An introduction to Reinforcement Learning

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Great, now Skynet is going to be into kinky stuff - WHAT HAS SCIENCE DONE?!

👍︎︎ 1 👤︎︎ u/Dave_The_Slushy 📅︎︎ Apr 04 2018 🗫︎ replies

Captions

from the amazing results and vintage Atari games deep Minds victory with alphago stunning breakthroughs in robotic arm manipulation and even beating professional players at 1v1 dota the field of reinforcement learning has literally exploded in recent years ever since the impressive breakthrough on the imagenet classification challenge in 2012 the successes of supervised deep learning have continued to pile up and people from many different backgrounds have started using deep neural nets to solve a wide range of new tasks including how to learn intelligent behavior in complex dynamic environments so in this episode I will give a general introduction into the field of reinforcement learning as well as an overview of the most challenging problems that we're facing today if you're looking for a solid introduction into the field of deep reinforcement learning then this episode is exactly what you're looking for my name is Xander and welcome to archive insights [Laughter] [Music] [Music] so in its 2017 Peter Emil gave a very inspiring demo in front of a large audience of some of the brightest minds in AI and machine learning so you showed this video where a robot is cleaning a living room bringing somebody a bottle of beer and basically doing a whole range of mundane tasks that robots in sci-fi movies can do without question and then at the end of the video peter revealed that the robots actions were actually entirely remote-controlled by a human operator and the takeaway from this demo I think is a very important one it basically says that the robots we've been building for decades now are physically perfectly capable of doing a wide range of useful tasks but the problem is that we can't embed them with the needed intelligence to do those things so basically creating useful state-of-the-art robotics is a software challenge and not a hardware problem and so it turns out that having a robot learn how to do something very simple like picking up a bottle of beer can be a very challenging task and so in this video I want to introduce you guys to the whole subfield in machine learning that's called reinforcement learning which i think is one of the most promising directions to actually get to very intelligent robotic behavior so in the most common machine learning applications people use what we call supervised learning and this means that you give an input to your neural network model but you know the output that your model should produce and therefore you can compute gradients using something like the back propagation algorithm to train that network to produce your outputs so imagine you want to train a neural network to play the game of pong what you would do in a supervised setting is you would have a good human gamer play the game of pong for a couple of hours and you would create a data set where you log all of the frames that that human is seeing on the screen as well as the actions that he takes in response to those frames so whatever is pushing the up arrow or the down arrow and we can then feed those input frames through a very simple neural network that at the output can produce two simple actions it's either going to select the up action or the down action and by simply training on the data set of the human gameplay using something like back propagation we can actually train that neural network to replicate the actions of the human gamer but there are two significant downsides to this approach so on the one hand if you want to do supervised learning you have to create a data set to train on which is not always a very easy thing to do and on the other hand if you train your neural network model to simply imitate the actions of the human player well then by definition your agent can never be better at playing the game of pong than that human gamer for example if you want to train a neural net to be better at playing the game of gold and the best human then by definition we can't use supervised learning so is there a way to have an agent learn to play a game entirely by itself well fortunately there is and this is called reinforcement learning so the framework and reinforcement learning is actually surprisingly similar to the normal framework in supervised learning so we still have an input frame we run it through some neural network model and the network produces an output action we either up or down but the only difference here is that now we don't actually know the target label so we don't know in any situation whether we should have gone up or down because we don't have a data set to train on and in reinforcement learning the network that transforms input frames to output actions is called the policy Network now one of the simplest ways to train a policy network is a method called policy gradients so the approach in policy gradients is that you start out with a completely random network you feed that network a frame from the game engine it produces a random up with action you know either up or down you send that action back to the game engine and the game engine produces the next frame and this is how the loop continues and the network in this case it could be a fully connected network but you can obviously apply convolutions there as well and now in reality the output of your network is going to consist of two numbers the probability of going up and the probability of going down and what you will do while training is actually sample from the distribution so that you're not always going to repeat the same exact actions and this will allow your agent to sort of explore the environment a bit randomly and hopefully discover better rewards and better behavior now importantly because we want to enable our agent to learn entirely by itself the only feedback that we're gonna give it is the scoreboard in the game so whenever our agent manages to score a goal it will receive a reward of +1 and if the opponent scored a goal then our agent will receive a penalty of minus 1 and the entire goal of the agent is to optimize its policy to receive as much reward as possible so in order to train our policy network the first thing we're gonna do is collect a bunch of experience so you're just gonna run a whole bunch of those game frames through your network select random actions feed them back into the engine and just create a whole bunch of random pong games and now obviously since our agent hasn't learned anything useful yet it's gonna lose most of those games but the thing is that sometimes our agent might get lucky sometimes it's going to randomly select a whole sequence of actions that actually lead to scoring a goal and in this case our agent is going to receive a reward and a key thing to understand is that for every episode regardless of whether we want a positive or a negative reward we can already compute the gradients that would make the actions that our agents has chosen more likely in the future and this is very crucial and so what policy gradients are going to do is that for every episode where we've got a positive reward we're going to use the normal gradients to increase the probability of those actions in the future but whenever we got a negative we're gonna apply the same gradient but we're gonna multiply it with minus one and this minus sign will make sure that in the future all the actions that we took in a very bad episode are going to be less likely in the future and so the result is that while training our policy network the actions that lead to negative rewards are slowly going to be filtered out and the actions that leads to positive rewards are going to become more and more likely so in a sense our agent is learning how to play the game of pong now I know this was a very quick introduction to reinforcement learning so if you want to read up a bit and spend a little bit more time in thinking about the details I really recommend to read and rake carpet these blog post pong from pixels it does a phenomenal job at explaining all the details all right so we can use policy gradients to train a neural network to play the game of pong that's amazing right well yes it is but as always there are a few very significant downsides to using this methods let's go back to pong one more time so imagine that your agent has been training for a while and it's actually doing a pretty decent job at playing the game of pong it's bouncing the ball back and forth but then at the end of the episode it makes a mistake it lets the ball through and it gets a negative penalty so the problem with policy gradients is that our policy gradient is going to assume that since we lost that episode all of the actions that we took there must be bad actions and is going to reduce the likelihood of taking those actions in the future but remember that actually the most part of that episode we were doing really well so we don't really want to decrease the likelihood of those actions and in reinforcement learning this is called the credit assignment problem it's where if you get a reward at the end of your episode well what are the exact actions that led to that specific reward and this problem is entirely related to the fact that we have what we call a sparse reward setting so instead of getting a reward for every single action we only get a reward after an entire episode and our agent needs to figure out what part of its action sequence we're causing the reward that it eventually gets so in the case of punk for example our agent should learn that it's only the actions right before it hits the ball that are truly important everything else once the ball is flying off it doesn't really matter for the eventual reward and so the result of this sparse reward setting is that in reinforcement learning algorithms are typically very sample inefficient which means that you have to give them a ton of training time before they can learn some useful behavior and I've made a previous video to compare the sample efficiency of reinforcement learning algorithms with human learning that goes much deeper into why this is the case and now it turns out that in some extreme cases the sparse reward setting actually fails completely so a famous example is the game Montezuma's Revenge where the goal of the agent is to navigate a bunch of ladders jump over a skull grab a key and then actually navigate to the door - in order to get to the next level and the problem here is that by taking random actions your agent is never gonna see a single reward because you know the sequence of actions that it needs to take to get that reward is just too complicated it's never gonna get there with random actions and so your policy gradient is never gonna see a single positive reward so it has no idea what to do in the same case applies to robotic control where for example you would like to train in robotic arm to pick up an object and stack it onto something else well the typical robot has about seven joints that it can move so it's a relatively high action space and if you only give it a positive reward when it's actually successfully stacked a block well by doing random exploration it's never gonna get to see any of that reward and I think it's important to compare this with the traditional supervised deep learning successes that we get into something like computer vision for example so the reason computer vision works so well is that for every single input frame you have a target label and this lets you do very efficient gradient descent with something like back propagation whereas in a reinforcement learning setting you're having to deal with this very big problem of sparse reward setting and this is why you know computer vision is showing some very impressive results while something is simple as stacking one block onto another seems very difficult even for state-of-the-art woodland [Music] and so the traditional approach to solve this issue of sparse rewards has been the use of rewards shaping so reward chipping is the process of manually designing a reward function that needs to guide your policy to some desired behavior so in the case of montezuma's revenge for example you could give your agent a reward every single time it manages to avoid the skull or reach the key and these extra rewards will guide your policy to some desired behavior and while this obviously makes it easier for your policy to converge to desired behavior there are some significant downsides to reward shaping so firstly reward shaping is a custom process that needs to be redone for every new environment you want to train a policy so if you're looking at the benchmark of Atari for example well you would have to craft a new reward function for every single one of those games that's just not scalable the second problem is that we Ward shaping suffers from what we call the alignment problem so it turns out that reward shaping is actually surprisingly difficult in a lot of cases when you when you shape your reward function your agent will find some very surprising way to make sure that it's getting a lot of reward but not doing at all what you wanted to do and in a sense the policy is just overfitting to that specific reward function that you designed while not generalizing to the intended behavior that you had in mind and there's a lot of funny cases where reward shaping goes terribly wrong so here for example the agent was trained to do jumping and the reward function was the distance from its feet to the ground and what this agent has learned is to simply grow a very tall body and do some kind of a backflip to make sure that its feet are very far from the ground to give you one final idea of how hard it can be to the reward shaping I mean look at this shaped reward function for a robotic control task I don't even want to know how long the people from this paper spent on designing this specific reward function to get the behavior that they wanted and finally in some cases like alphago for example by definition you don't want to do any reward shaping because this will constrain your policy to the behavior of humans which is not exactly optimal in every situation so the situation that we're in right now is that we know that it's really hard to train in a sparsely setting but at the same time it's also very tricky to shape a reward function and we don't always want to do that and to end this video I would like to note that a lot of media stories picture reinforcement learning as some kind of a magical AI sauce that lets the agent learn on itself or improve upon its previous version but the reality is that most of these breakthroughs are actually the work of some of the brightest minds alive today and there's a lot of very hard engineering going on behind the scenes so I think that one of the biggest challenges in navigating our digital landscape is discerning truth from fiction in this ocean of clickbait that is powered by the advertisement industry and I think the Atlas robot from Boston Dynamics is a very clear example of what I mean so I think if you go out on the streets and you ask a thousand people with the most advanced robots today are well they would probably point to Atlas from Boston Dynamics because everybody has seen the video where it does a backflip but the reality is that if you think about what's what Boston Dynamics is actually doing well it's very likely that there's not a lot of deep learning going on there if you look at their previous papers in the research track record well they're they're doing a lot of very advanced robotics don't get me wrong but there's not a lot of self driven behavior there's not a lot of intelligent decision-making going on in those robots so don't get me wrong Boston Dynamics is a very impressive robotics company but the media images they've created might be a little bit confusing to a lot of people that don't know what's going on behind the scenes but nonetheless if you look at the progress of research that is going on I think we should not be negligible of the potential risks that these technologies can bring so I think it's very good that a lot more people are getting involved in the whole a AI safety research because this is going to become very fundamental threats like autonomous weapons and mass surveillance are to be taken very seriously and so the only hope we have is that international law is going to be somewhat able to keep up with the rapid progress we see in technology but on the other hand I also feel like the media is focusing way too much on the negative side of these technologies simply because people fear what they don't understand and well fear sells more advertisement than utopias so I personally believe that most if not all technological progress is beneficial in the long run as long as we can make sure that there are no monopolies that can maintain or enforce their power with the malignant use of AI well anyway enough politics for one video so this video was an introduction into deep reinforcement learning and an overview of the most challenging problems that we're facing in the field in the next video I will dive into some of the most recent approaches that try to tackle these problems of sample efficiency and the sparse reward setting so specifically I will cover a few technical papers dealing with approaches like auxilary or reward settings intrinsic curiosity hindsight experience replay and so on I've also seen that a few people have chosen to support me on patreon for which I would just like to say thank you very much I mean it really means a big deal to me I'm doing these videos completely in my spare time and knowing that there's people out there that appreciate this content really feels great so thank you very much thanks for watching don't forget to subscribe and I'd love to see you again in the next episode of archived insights you

Info

Channel: Arxiv Insights

Views: 403,222

Rating: 4.9472108 out of 5

Keywords: Reinforcement Learning, Machine Learning, Policy Gradients, Neural Networks, Robotic arm, Robotic control, learning robots, autonomous robots, policy network, alignment problem, reward shaping, sparse rewards, reward signal, continuous control, deep reinforcement learning

Id: JgvyzIkgxF0

Channel Id: undefined

Length: 16min 27sec (987 seconds)

Published: Mon Apr 02 2018