[MUSIC PLAYING] SERGIO GUADARRAMA: So my
name is Sergio Guadarrama. I'm a senior software
engineer at Google Brain, and I'm leading
the TF agents team. EUGENE BREVDO: And I'm Eugene
Brevdo, a software engineer on Google Brain team. I work in
reinforcement learning. SERGIO GUADARRAMA:
So today we're going to talk about
reinforcement learning. So how many of you remember
how you learned how to walk? You stumble a little bit. You try one step,
it doesn't work. You lose your balance. You try again. So when you're trying to
learn something that is hard and you need a lot of practice,
you need to try many times. So this cute little robot is
basically trying to do that-- just moving the legs. It doesn't coordinate
very well, but it's trying to learn how to walk. After learning to
try multiple times-- in this case 1,000 times-- it learns a little
bit how to move the first steps, moving
forward a little bit before falling off. If we let it train
a little longer, then it's able to actually
walk around, go from one place to another, and find
their way around the room. Probably you have heard
about all the applications of reinforcement learning
over the last couple of years, you know, including
recommended systems, data, certain [INAUDIBLE],,
real robots like that, chemistry, math, this
little [? killer ?] robot, but also like AlphaGo,
that play Go, like, better than any human. Now I have a question for you. How many of you have tried
to actually implement an RL algorithm? OK. I see quite a bit the hands. Very good. It's hard. [LAUGHTER] Yeah. We went through that pain too. Many people who try, get the
first prototype right away. It seems to be
working, but then you miss a lot of different pieces. All the details
have to be right. All the things, all
the bugs, everything, because it's very unstable. [INAUDIBLE] is a feature. So there's a lot of
pieces, a replay buffer-- there's a lot of
things you need to do. So we suffer through the
same problem at Google, so we decided to implement a
library that many people can use. And today the TF-Agents
team is very happy to announce that it's
available online. You can go to GitHub. You can pip install it and
start using it right way. And hopefully you will
provide feedback contributions so we can make this
better over time. So now, what is TF-Agents
and what it provides? So we tried to make it
very robust, very scalable, and easy to use reinforced
learning for TensorFlow. So it's going to be easy
to debug, easy to try, and easy to get
good things going. For people who are new
to reinforced learning, we have colabs and things,
documentation and samples so you can learn about it. For people who want to
really solve a real problem, a complex problem,
we have already ways to state-of-the-art algorithms
and apply [? very quickly. ?] For people who are
researchers and want to develop new RL
algorithms, they don't need to build all
of the single pieces. They can build on top of it. We make it well-tested
and easy to configure, so you kind of start doing
your experiments right away. We build on top of all the
goodies of TensorFlow 2.0 that you just saw
today, like TF-Eagers to make the development
and the debugging a lot easier, tf.keras
to build the networks on and models, tf.function,
when you want to make things to go faster. And then we make it very
modular and extensible, so you can cherry pick. It's elaborate. You can cherry pick with
the pieces that you need and extend them as you need it. And for those who are not
ready for the change yet, we make it compatible
with TensorFlow 1.14. So if we go back to the little
sample of the little robot trying to walk, this
is in a nutshell, how the code looks like. You have to define
some networks, in this case an active
distribution network and a critic network, and
then an actor, and an agent, [INAUDIBLE] agent in this case. And then assuming we have some
experience already collected, we can just train through it. TF-Agents provide a
lot of RL algorithms, RL environment already, like
[INAUDIBLE],, Atari, Mujoco, PyBullet, DM-Control,
and maybe your soon. We also provide state-of-the-art
algorithms, including DQN, TD3, PPO, [INAUDIBLE]
and many others. And they are more coming
soon, and hopefully more from the community. They are fully tested
with quality regression tests and a speed test to
make things keep working. As an overview of the
system, it looks like that. On the left side you have all
the collection aspects of it. And we're going to
have some policy. It's going to interact
with the environment and collect some experience. Probably we put in some
replay buffers to work later. And on the right side we have
all the training pipeline, where we're going to write
from this experience, and our agent is going to
learn to improve the policy by training a neural network. Let's focus for a little
bit in this environment. How do we define a problem? How do we define a new task? Let's take another example. In this case it's Breakout. The idea is like, you
have to play this game, move the paddle left
and right, and try to break the bricks on the top. You break the bricks, you get
rewards, so the points go up. You let the ball drop,
then the points go down. So the agent is going to
receive some observation, in this case, multiple frames. From the environment,
it's going to decide which action to take, and
then based on that it's going to get some reward. And then just loop again. How this looks into the
code is something like this. You define the observation
and specification. It's like, what
kind of observation does this environments provide? In this case it could be frames,
but it could be any tensor, so any other information-- multiple
cameras, multiple things. And then the action specs--
it's like, what actions can I make in this environment? In this case only
left and right. But in many other environments
we have multiple options. Then a reset method,
because we're going to play this
game a lot of times, so we have to reset
the environment, and then a step method. Taking an action is going
to produce a new observation and give us a reward. Given that, we could define a
policy by hand, for example, and start playing this game. You just create an environment,
define your policy, reset the environment,
and start looping over it, and start playing the game. If your policy is very good,
you will get a good score. To make the learning
even faster, we make these
parallel environments, so you can run these games
in parallel, multiple times and wrapped in TensorFlow
so it will go even faster, and then do the same loop again. What happened in
general is, like, we don't want to define
these policies by hand, so let me hand it
over to Eugene, who's going to explain how
to land those policies. EUGENE BREVDO:
Thank you, Sergio. So, yeah, as Sergio
said, we've given an example of how you would
interact with an environment via a policy. And we're going to go into
a little bit more detail and talk about how
to make policies, how to train policies
to maximize the rewards. So kind of going over
it again, policies take observations and emit a
distribution over the actions. And in this case,
the observations are an image, or
a stack of images. There is an underlying
neural network that converts those
images to the parameters of the distribution. And then the policy
emits that distribution, or you might sample from it
to actually take actions. So let's talk about networks. I think you've seen some
variation of this slide over and over again today. A network, in this
case, a network used for deep Q learning
is essentially a container for a bunch of keras.layers. In this case your inputs go
through a convolution layer and so on and so forth. And then the final
layer emits logits over the number of actions
that you might take. The core method of the
network is the call. So it takes observations in a
state, possibly an RNN state and emits the logits in
the new updated state. OK. So let's talk about policies. First of all, we
provide a large number of policies, some of them
specifically tailored to particular algorithms
and particular agents. But you can also
implement your own. So it's useful to
go through that. So a policy takes
one of more networks. And the fundamental
method on a policy is the distribution method. So this takes the time step. It essentially contains
the observation, passes that through
one or more networks and emits the parameters
of the output distribution, in this case, logits. It then returns a
tuple of three things. So the first thing is a
actual distribution object. So Josh Dylan just spoke
about this with probability, and here's this
little probability category called distribution
built from those logits. It emits the next state,
again, possibly containing some RNN state information, and
it also emits site information. So site information is useful. Perhaps you want to
emit some information that you want to log in your
metrics that's not the action, or you maybe want to log
some information that is necessary for
training later on. So the agent is going
to use that information to actually train. OK. So now let's actually
talk about training. The agent class encompasses
the main RL algorithm, and that includes the
training and reading batches of data and trajectories
to update the neural network. So here's a simple example. First you create a
deep Q learning agent. You give it a network. You can access a policy,
specifically a collection policy from that agent. That policy uses the
underlying network that you passed in, and maybe
performs some additional work. Like maybe it performs
some exploration, like epsilon greedy
exploration, and also logs site information that is
going to be necessary to be able to train the agent. The main method on the
agent is called train. It takes experience in the
form of batch trajectories. These come, for example,
from a replay buffer. Now assuming you have
trained your networks and you're performing well
during data collection, you also might want
to take a policy that performs more greedy action
and doesn't explore it all. It just exploits. It takes the best actions
that it thinks are the best and doesn't log any
site information, doesn't admit any
site information. So that's the deployment policy. You can save this
SaveModel, for example, and put it into the frame. So a more complete example-- again, here we have a
deep Q learning network. It accepts the observation
and action specs from the environment,
and some other arguments describing what kind of
keras layers to combine. You build the agent
with that network. And then you get a
tf.data data set. In this case you get it
from a replay buffer object. But you can get it
from any other data set that emits the correct
form and trajectory, batch trajectory information. And then you iterate over that
data set, calling agent.train to update the underlying
neural networks, which are then reflected in the
updated policies. So let's talk a little
bit about collection. Now given a collection policy-- and it doesn't
have to be trained. It can have just
random parameters. You want to be able
to collect data. And we provide a number
of tools for that. Again, if your environment is
something that is in Python, you can wrap it. So the core tool for
this is the driver. And going through
that, first you create your batched
environments at the top. Then you create a replay buffer. In this case we have a
TF uniform replay buffer. So this is a replay buffer
backed by TensorFlow variables. And then you create the driver. So the driver accepts
the environment, the collect policy
from the agent, and a number of callbacks. And when you call
driver.run, what it will do is it will iterate,
in this case, it will take 100
steps of interaction between the policy
and the environment, create trajectories, and
pass them to the observers. So after driver.run
has finished, your replay buffer has been
populated with a hundred more frames of data. So here's kind of
the complete picture. Again, you create
your environment. You interact with that
environment through the driver, given a policy. Those interactions get
stored in the replay buffer. The replay buffer, you read
from with the tf.data data set. And then the agent
trains with batches from that data set, and
updates the network underlying the policy. Here's kind of a set
of commands to do that. If you look at the
bottom, here's that loop. You call it driver.run
to collect data. It stores that in
the replay buffer. And then you read from the data
set generated from that replay buffer and train the agent. You can iterate this
over and over again. OK. So we have a lot of
exciting things coming up. For example, we have
a number of new agents that we're going to release-- C51, D4PG and so on. We're adding complete support
for contextual bandits that are backed by neural
networks to the API. We're going to release
a number of baselines, as well as a number
of new replay buffers. So in particular we're going to
be releasing some distributed replay buffers in the
next couple of quarters, and those will be used for
distributed collection. So distributed
collection allows you to parallelize your
data collection across many machines,
and be able to maximize the throughput of your training
URL algorithm that way. We're also working on
distributed training using TensorFlow's new
distribution strategy API, allowing you to train at a
massive scale on many GPUs and TPUs. And we're adding support
for more environments. So please check out
TF-Agents on GitHub. And we have a
number of colabs, I think eight or nine
as of this count, exploring different
parts of the system. And as Sergio said, TF-Agents is
built to solve many real world problems. And in particular, we're
interested in seeing what your problems are, for
example, where we welcome contributions for new
environments, new RL algorithms, for those of you
out there who are RL experts. Please come chat with me
or Sergio after the talks or file an issue on the
GitHub issue tracker. And let us know. Let us know what you think. Thank you very much. [MUSIC PLAYING]