Reinforcement Learning in TensorFlow with TF-Agents (TF Dev Summit '19)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[MUSIC PLAYING] SERGIO GUADARRAMA: So my name is Sergio Guadarrama. I'm a senior software engineer at Google Brain, and I'm leading the TF agents team. EUGENE BREVDO: And I'm Eugene Brevdo, a software engineer on Google Brain team. I work in reinforcement learning. SERGIO GUADARRAMA: So today we're going to talk about reinforcement learning. So how many of you remember how you learned how to walk? You stumble a little bit. You try one step, it doesn't work. You lose your balance. You try again. So when you're trying to learn something that is hard and you need a lot of practice, you need to try many times. So this cute little robot is basically trying to do that-- just moving the legs. It doesn't coordinate very well, but it's trying to learn how to walk. After learning to try multiple times-- in this case 1,000 times-- it learns a little bit how to move the first steps, moving forward a little bit before falling off. If we let it train a little longer, then it's able to actually walk around, go from one place to another, and find their way around the room. Probably you have heard about all the applications of reinforcement learning over the last couple of years, you know, including recommended systems, data, certain [INAUDIBLE],, real robots like that, chemistry, math, this little [? killer ?] robot, but also like AlphaGo, that play Go, like, better than any human. Now I have a question for you. How many of you have tried to actually implement an RL algorithm? OK. I see quite a bit the hands. Very good. It's hard. [LAUGHTER] Yeah. We went through that pain too. Many people who try, get the first prototype right away. It seems to be working, but then you miss a lot of different pieces. All the details have to be right. All the things, all the bugs, everything, because it's very unstable. [INAUDIBLE] is a feature. So there's a lot of pieces, a replay buffer-- there's a lot of things you need to do. So we suffer through the same problem at Google, so we decided to implement a library that many people can use. And today the TF-Agents team is very happy to announce that it's available online. You can go to GitHub. You can pip install it and start using it right way. And hopefully you will provide feedback contributions so we can make this better over time. So now, what is TF-Agents and what it provides? So we tried to make it very robust, very scalable, and easy to use reinforced learning for TensorFlow. So it's going to be easy to debug, easy to try, and easy to get good things going. For people who are new to reinforced learning, we have colabs and things, documentation and samples so you can learn about it. For people who want to really solve a real problem, a complex problem, we have already ways to state-of-the-art algorithms and apply [? very quickly. ?] For people who are researchers and want to develop new RL algorithms, they don't need to build all of the single pieces. They can build on top of it. We make it well-tested and easy to configure, so you kind of start doing your experiments right away. We build on top of all the goodies of TensorFlow 2.0 that you just saw today, like TF-Eagers to make the development and the debugging a lot easier, tf.keras to build the networks on and models, tf.function, when you want to make things to go faster. And then we make it very modular and extensible, so you can cherry pick. It's elaborate. You can cherry pick with the pieces that you need and extend them as you need it. And for those who are not ready for the change yet, we make it compatible with TensorFlow 1.14. So if we go back to the little sample of the little robot trying to walk, this is in a nutshell, how the code looks like. You have to define some networks, in this case an active distribution network and a critic network, and then an actor, and an agent, [INAUDIBLE] agent in this case. And then assuming we have some experience already collected, we can just train through it. TF-Agents provide a lot of RL algorithms, RL environment already, like [INAUDIBLE],, Atari, Mujoco, PyBullet, DM-Control, and maybe your soon. We also provide state-of-the-art algorithms, including DQN, TD3, PPO, [INAUDIBLE] and many others. And they are more coming soon, and hopefully more from the community. They are fully tested with quality regression tests and a speed test to make things keep working. As an overview of the system, it looks like that. On the left side you have all the collection aspects of it. And we're going to have some policy. It's going to interact with the environment and collect some experience. Probably we put in some replay buffers to work later. And on the right side we have all the training pipeline, where we're going to write from this experience, and our agent is going to learn to improve the policy by training a neural network. Let's focus for a little bit in this environment. How do we define a problem? How do we define a new task? Let's take another example. In this case it's Breakout. The idea is like, you have to play this game, move the paddle left and right, and try to break the bricks on the top. You break the bricks, you get rewards, so the points go up. You let the ball drop, then the points go down. So the agent is going to receive some observation, in this case, multiple frames. From the environment, it's going to decide which action to take, and then based on that it's going to get some reward. And then just loop again. How this looks into the code is something like this. You define the observation and specification. It's like, what kind of observation does this environments provide? In this case it could be frames, but it could be any tensor, so any other information-- multiple cameras, multiple things. And then the action specs-- it's like, what actions can I make in this environment? In this case only left and right. But in many other environments we have multiple options. Then a reset method, because we're going to play this game a lot of times, so we have to reset the environment, and then a step method. Taking an action is going to produce a new observation and give us a reward. Given that, we could define a policy by hand, for example, and start playing this game. You just create an environment, define your policy, reset the environment, and start looping over it, and start playing the game. If your policy is very good, you will get a good score. To make the learning even faster, we make these parallel environments, so you can run these games in parallel, multiple times and wrapped in TensorFlow so it will go even faster, and then do the same loop again. What happened in general is, like, we don't want to define these policies by hand, so let me hand it over to Eugene, who's going to explain how to land those policies. EUGENE BREVDO: Thank you, Sergio. So, yeah, as Sergio said, we've given an example of how you would interact with an environment via a policy. And we're going to go into a little bit more detail and talk about how to make policies, how to train policies to maximize the rewards. So kind of going over it again, policies take observations and emit a distribution over the actions. And in this case, the observations are an image, or a stack of images. There is an underlying neural network that converts those images to the parameters of the distribution. And then the policy emits that distribution, or you might sample from it to actually take actions. So let's talk about networks. I think you've seen some variation of this slide over and over again today. A network, in this case, a network used for deep Q learning is essentially a container for a bunch of keras.layers. In this case your inputs go through a convolution layer and so on and so forth. And then the final layer emits logits over the number of actions that you might take. The core method of the network is the call. So it takes observations in a state, possibly an RNN state and emits the logits in the new updated state. OK. So let's talk about policies. First of all, we provide a large number of policies, some of them specifically tailored to particular algorithms and particular agents. But you can also implement your own. So it's useful to go through that. So a policy takes one of more networks. And the fundamental method on a policy is the distribution method. So this takes the time step. It essentially contains the observation, passes that through one or more networks and emits the parameters of the output distribution, in this case, logits. It then returns a tuple of three things. So the first thing is a actual distribution object. So Josh Dylan just spoke about this with probability, and here's this little probability category called distribution built from those logits. It emits the next state, again, possibly containing some RNN state information, and it also emits site information. So site information is useful. Perhaps you want to emit some information that you want to log in your metrics that's not the action, or you maybe want to log some information that is necessary for training later on. So the agent is going to use that information to actually train. OK. So now let's actually talk about training. The agent class encompasses the main RL algorithm, and that includes the training and reading batches of data and trajectories to update the neural network. So here's a simple example. First you create a deep Q learning agent. You give it a network. You can access a policy, specifically a collection policy from that agent. That policy uses the underlying network that you passed in, and maybe performs some additional work. Like maybe it performs some exploration, like epsilon greedy exploration, and also logs site information that is going to be necessary to be able to train the agent. The main method on the agent is called train. It takes experience in the form of batch trajectories. These come, for example, from a replay buffer. Now assuming you have trained your networks and you're performing well during data collection, you also might want to take a policy that performs more greedy action and doesn't explore it all. It just exploits. It takes the best actions that it thinks are the best and doesn't log any site information, doesn't admit any site information. So that's the deployment policy. You can save this SaveModel, for example, and put it into the frame. So a more complete example-- again, here we have a deep Q learning network. It accepts the observation and action specs from the environment, and some other arguments describing what kind of keras layers to combine. You build the agent with that network. And then you get a tf.data data set. In this case you get it from a replay buffer object. But you can get it from any other data set that emits the correct form and trajectory, batch trajectory information. And then you iterate over that data set, calling agent.train to update the underlying neural networks, which are then reflected in the updated policies. So let's talk a little bit about collection. Now given a collection policy-- and it doesn't have to be trained. It can have just random parameters. You want to be able to collect data. And we provide a number of tools for that. Again, if your environment is something that is in Python, you can wrap it. So the core tool for this is the driver. And going through that, first you create your batched environments at the top. Then you create a replay buffer. In this case we have a TF uniform replay buffer. So this is a replay buffer backed by TensorFlow variables. And then you create the driver. So the driver accepts the environment, the collect policy from the agent, and a number of callbacks. And when you call driver.run, what it will do is it will iterate, in this case, it will take 100 steps of interaction between the policy and the environment, create trajectories, and pass them to the observers. So after driver.run has finished, your replay buffer has been populated with a hundred more frames of data. So here's kind of the complete picture. Again, you create your environment. You interact with that environment through the driver, given a policy. Those interactions get stored in the replay buffer. The replay buffer, you read from with the tf.data data set. And then the agent trains with batches from that data set, and updates the network underlying the policy. Here's kind of a set of commands to do that. If you look at the bottom, here's that loop. You call it driver.run to collect data. It stores that in the replay buffer. And then you read from the data set generated from that replay buffer and train the agent. You can iterate this over and over again. OK. So we have a lot of exciting things coming up. For example, we have a number of new agents that we're going to release-- C51, D4PG and so on. We're adding complete support for contextual bandits that are backed by neural networks to the API. We're going to release a number of baselines, as well as a number of new replay buffers. So in particular we're going to be releasing some distributed replay buffers in the next couple of quarters, and those will be used for distributed collection. So distributed collection allows you to parallelize your data collection across many machines, and be able to maximize the throughput of your training URL algorithm that way. We're also working on distributed training using TensorFlow's new distribution strategy API, allowing you to train at a massive scale on many GPUs and TPUs. And we're adding support for more environments. So please check out TF-Agents on GitHub. And we have a number of colabs, I think eight or nine as of this count, exploring different parts of the system. And as Sergio said, TF-Agents is built to solve many real world problems. And in particular, we're interested in seeing what your problems are, for example, where we welcome contributions for new environments, new RL algorithms, for those of you out there who are RL experts. Please come chat with me or Sergio after the talks or file an issue on the GitHub issue tracker. And let us know. Let us know what you think. Thank you very much. [MUSIC PLAYING]
Info
Channel: TensorFlow
Views: 44,540
Rating: 4.8615179 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: TensorFlow, purpose: Educate
Id: -TTziY7EmUA
Channel Id: undefined
Length: 15min 58sec (958 seconds)
Published: Wed Mar 06 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.