Reinforcement Learning with sparse rewards

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi everybody so in the previous video we talked about reinforcement learning and introduced the general paradigm and the problems that we have with sparse reward settings now in this video I want to dive a little bit deeper into some more technical work that tries to solve this problem of the sparse reward setting so I hope you will remember from the previous video that the thing that makes reinforcement learning so difficult is that we're dealing with sparse rewards which means that for every input frame we don't actually know the target label that our network should produce and therefore our agent needs to learn from very sparse feedback and sort of figure out by itself what action sequences led to the eventual reward and one particular solution that has been emerging from research is to augment the sparse extrinsic reward signal that is coming from the environment by additional dense reward signals that need to aid the learning of the agent and so in particular in this video I want to dive into some interesting technical work that introduces ideas like additional reward signals curiosity driven exploration and hindsight experience replay are you ready for a technical deep dive into state-of-the-art reinforcement learning my name is Andrew and welcome to archive insights [Laughter] [Music] [Music] so in order to solve some of the most challenging problems in reinforcement learning we have seen the emergence of a wide range of new ideas in reinforcement learning research and one particular trend that has been very popular recently is to augment the sparse extrinsic reward signal that is coming from the in-game environment by additional feedback signals that need to help the learning of your agent and many of these new ideas can be seen as variations of the same general concept so instead of having a very sparse reward signal that your your agent occasionally sees we want to design additional feedback signals that are very dense or in a sense we want to create a supervised setting and the goal of those additional rewards there was additional feedback signals is that they are in some way related to the task that we want our agent to solve so we want to create these dense feedback signals that whenever our agent succeeds in those tasks it's probably also going to get knowledge or feature extractors that can be useful in the eventual task the sparse tasks that we can really care about and so obviously there's no way that I can give you guys a complete overview of all the super interesting research that's going on out there but I try to sketch a few very interesting papers that try to give you an idea of the general directions that we're seeing in research right now so let's start with auxiliary losses so in most reinforcement learning settings our agent is presented with some kind of raw input data like sequences of images for example the agent will then apply some kind of a feature extraction pipeline in order to extract the useful information from those raw input images and then it will have a policy network that uses those extracted features in order to perform a task that we wanted to learn now the problem in reinforcement learning is that our feedback signal can be so sparse that the agent never succeeds in extracting useful features from the input frames and so one successful approach in this case is to add additional learning goals to our agent that leverage the strengths of supervised learning to come up with very useful feature extractors on those images so let's take a look at one specific paper from google deepmind called reinforcement learning with unsupervised auxiliary tasks so here the standard sparse reward signal is that our agent is walking around in this 3d May and it needs to find specific object and whenever it encounters one of those objects it gets a reward right but what they do is that instead of having this very sparse feedback signal they also augment the whole training process with three additional reward signals so first the agent learns to do what they call pixel control so given a frame from the environment it uses the main feature extraction pipeline and it learns a separate policy to maximally change the pixel intensities in some parts of the input images so for example it could learn that by looking up at the sky for example this completely changes all the pixel values in the input C and in their suggested implementation the input frame is divided into a small number of grids and a visual change score is computed for each grid the policy is then trained to maximize the total visual change in all the grids and the idea here is that this will force the feature extractor to become sensitive to the general dynamics in the game environment the second auxiliary task is called reward prediction and so in this case the agent is given three recent frames from the episode sequence and it's tasked to predict the reward that will get in the next step so again we are adding an additional learning goal to our agent that is only used to optimize the feature extraction pipeline in a way that we think is generally useful for the eventual goal that we care about and then there is one final task called value function replay which basically tries to estimate the value of being in a current state by predicting the total future reward that the agent is going to get from this point onwards so this is basically what any off policy algorithm like DQ n for example is doing all the time and so it turns out that by adding these relatively simple additional goals to our training pipeline we can significantly increase the sample efficiency of our learning agent and especially the addition of the pixel control tasks seems to work really really well in three-dimensional environments we're learning to control your gaze direction and how this influences your own visual input signals is very crucial for learning any kind of useful behavior all right now the second thing I want to look at is something called curiosity driven exploration and the general idea here is that you want to somehow incentivize your agent to learn about new things that it discovers in its environment so in most default reinforcement learning algorithms people use what we call epsilon greedy exploration and this means that in most cases your agent is good is just gonna select the best possible move based on its current policy but with a small probability of epsilon is going to take a random action and then this epsilon starts out at 100% in the beginning of training so it's completely doing random things and as you train as you keep progressing it's actually going to start declining this epsilon value until in the end you completely follow your policy and the idea is that by doing these random actions your agent is going to learn to explore the environment and now the general idea behind curiosity driven exploration is that in many cases an agent can quickly learn some type of very simple behavior that earns a recurring low amount of rewards but if the environment is hard to explore a simple agent using epsilon greedy exploration will never completely explore the full environment in search of better policies and the idea isn't to create an additional reward signal that incentivizes the agent to explore unseen regions in the state space and so a standard way to do this in reinforcement learning is we will recall a forward model so this means that your agent is going to see a specific input frame it's going to use some kind of a feature extractor to encode that input data into some kind of a latent representation and then you have a forward model that tries to predict that same latent representation for the next frame in the environment so basically it's going to try and learn the dynamics of its environment where based on what I'm seeing now what is going to happen next right and so the assumption is then that if your agent is in a place where it's been many times before these prediction losses these predictions will be very good but if it's encountering a totally new situation it has never seen before then it's forward model will probably not be that accurate and then the idea is that you can use these prediction errors as an additional feedback signal on top of the sparse rewards - you know incentivize your agent to explore unseen regions of the state space and so one particular paper that I want to highlight here uses what the authors introduced as intrinsic curiosity module and uses a very good example to show what this all means so imagine a scenario where an agent is observing the movement of tree leaves in a breeze now since it's really hard to exactly model the breeze predicting pixel changes for each leaf is going to be virtually impossible and this means that the prediction error in pixel space will always remain high and the agent will be forever curious about the leaves you need to and the underlying problem here is that the agent is not aware that there are some parts of the environment that it simply cannot control or predict okay so here is what the intrinsic curiosity model in the paper looks like the raw environment States s and s plus 1 are first encoded into feature space using a single shared encoder network next there are two models you have the forward model which tries to predict the features for the next state using the chosen action from the policy and you have the inverse model that tries to predict what action was taken to go from state s to the next feature state s plus 1 and finally the feature encoding of s plus 1 is compared with the predicted featuring coding of s plus 1 given by the forward model and the difference which we could call you know the agent surprised about what happened is added to the reward signal for training the agent and so going back to our example of the tree leaves since the motion of those leaves cannot be controlled by actions of the agent there is no incentive for the feature encoders to actually model the behavior of those tree leaves because in the inverse model those features will never be useful for actually predicting the action that the agent took and therefore the resulting features from our extraction pipeline will be completely unaffected by irrelevant aspects of our environment and we will get a much more robust exploration strategy so in the paper they benchmark there our method using a maze exploration so you have a very complicated maze with the target position and your agent needs to navigate around the maze to find that goal and you can see that as they increase the size of the meso the complexity of that maze all the methods that do not use intrinsic exploration methods kind of fail if the maze gets too big but you if you incentivize your agent to explore unseen regions and it's much more likely that it's going to eventually find the reward and so I think this is a really clever idea that instead of just thinking about ok I want to get to this goal position I want to get this reward your agent should also you know it has to be curious about the world it needs to explore things that it doesn't know about in order to you know increase its general knowledge of the environment so really cool idea and again this is just one paper working in this area but I think a lot of people are realizing that getting your agent to efficiently and intrinsically explore your environment is a very crucial part of learning alright another very beautiful extension to the standard reward setting is a recent paper from opening I called hindsight experience replay or in short and the idea behind her is very very simple but very very effective so in sense I'm a really big fan of the paper because of the simplicity of the idea so you imagine you want to train a robotic arm to push an object on the table to a specific target location the problem again is that if you do this by random exploration it's very very unlikely that you're gonna get a lot of rewards and so it's going to become very difficult to train this policy and so the general solution to do this would be to shape a dense reward which is for example the distance of the object to the target location in Euclidean space right so that way you have a very very specific dense reward for every single frame and you can use simple gradient descent to train this now the problem is again as we've seen that reward shaping is not actually the solution that we want to use we would like to do this with a simple sparse reward you know success or no success and so the general idea behind hindsight experience replay is that they want to learn from all episodes even if an episode was not successful for the actual tasks that we want to learn and so hindsight experience replay applies a very clever and very simple trick to get the agent to learn from unsuccessful episode so in the beginning of learning the agent is pushing around an object on the table and it's trying to get to position 8 but since the policy is not very good yet the object ends up at position B which is not the right thing this is not what we wanted so instead of just saying hey you did it wrong you got a reward of 0 what her will do is it will pretend as if going to be was actually the thing you wanted it to do and then telling it yes very well done this is how you move the object to position B you're basically creating a very dense reward setting from a sparse reward problem so let's go over the algorithm step by step so we start off with a normal off policy reinforcement learning algorithm and a strategy for sampling goal position right we initialize all the parameters that we need and we start playing some episodes so given a specific goal position we're just going to use our current policy we get a trajectory and we get a final position where the object ended up in so after our episode has finished we store all those transitions in the replay buffer with the goal that was selected for the policy but then we also sample a set of modified additional goals and we swap those out in the state transitions and then store everything in the replay buffer and the really nice thing about this algorithm is that after training you now have a policy network that can do something different based on the goal that you give it so if you want to move the object to a new location you don't have to retrain the whole policy you can just change the goal vector and your policy will do the right thing and so in the graph here the blue curve represents the results from hindsight experience we play when the additional sampled goal was always the final state of the episode sequence right so it's the it's the actual position that the object ended up in after executing a sequence of actions the red curve which is even better represents the results when the additional goals are sampled from future states that are encountered on the same trajectories and now what I like so much about this paper is that the idea is very simple and the implementation of this algorithm is very simple as well but it addresses this very fundamental problem in learning which is that we want to use every single experience that we have as maximally as possible so in summary we've just seen a few very different approaches to augmenting sparse reward signals with dense feedback that I think hints at some of the first steps towards truly unsupervised learning but despite the very impressive results that we've seen there are still a lot of really challenging problems in reinforcement learning so things like generalization transfer learning causality into physics I mean these problems remain as challenging as ever but I would like to end this video by stating that I think reinforcement learning has tremendous potential if you think about an autonomous robotic assistant in your house these things might still seem like science fiction today but I think the reality is that we are currently in the process of solving some of the most fundamental challenges in autonomous learning much the same way that we had to do in supervised learning people came up with algorithms like back propagation or techniques like convolutions for example I think that by looking at the impressive pace of progress over the past few years and the sheer amount of intellectual capacity that is working on these problems I think breakthroughs could come surprisingly fast and so obviously the darker side of this very exciting research is that many of the jobs that we have today will be subject to a large degree of automation an inevitable transition which I think will create a lot of social pressure and inequalities and it's probably one of the biggest challenges if we want to create a world where everybody can benefit from the advancement of artificial intelligence so that was it thank you very much for watching I hope you learned something and I'd like to see you again in the next episode of archived insights
Info
Channel: Arxiv Insights
Views: 105,962
Rating: undefined out of 5
Keywords: Reinforcement Learning, Unsupervised Learning, Deep Learning, Auxiliary Rewards, Curiosity Driven Exploration, Hindsight Experience Replay, Deep Reinforcement Learning, DeepMind, UC Berkeley, BAIR, OpenAI, Sparse Rewards, Sparse Reward setting
Id: 0Ey02HT_1Ho
Channel Id: undefined
Length: 16min 1sec (961 seconds)
Published: Fri Jun 01 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.