[MUSIC PLAYING] JEFF DEAN: I'm excited to
be here today to tell you about how I see deep
learning and how it can be used to solve some of
the really challenging problems that the world is facing. And I should point out
that I'm presenting the work of many, many
different people at Google. So this is a broad perspective
of a lot of the research that we're doing. It's not purely my work. So first, I'm sure you
may have all noticed, but machine learning is
growing in importance. There's a lot more emphasis
on machine learning research. There's a lot more uses
of machine learning. This is a graph showing
how many Arxiv papers-- Arxiv is a preprint
hosting service for all kinds of
different research. And this is the
subcategories of it that are related to
machine learning. And what you see is that,
since 2009, we've actually been growing the number of
papers posted at a really fast exponential rate, actually
faster than the Moore's Law growth rate of computational
power that we got so nice and used to for 40 years
but it's now slowed down. So we've replaced the nice
growth in computing performance with growth in people
generating ideas, which is nice. And deep learning is
this particular form of machine learning. It's actually a
rebranding in some sense of a very old set of
ideas around creating artificial neural networks. These are these collections of
simple trainable mathematical units organized in layers where
the higher layers typically build higher levels
of abstraction based on things that the
lower layers are learning. And you can train these
things end to end. And the algorithms that
underlie a lot of the work that we're doing
today actually were developed 35, 40 years ago. In fact, my colleague
Geoff Hinton just won the Turing Award this
year along with Yann LeCun and Yoshua Bengio
for a lot of the work that they did over the
past 30 or 40 years. And really the
ideas are not new. But what's changed is we got
amazing results 30 or 40 years ago on toyish
problems but didn't have the computational resources
to make these approaches work on real large scale problems. But starting about
eight or nine years ago, we started to have enough
computation to really make these approaches work well. And so what are things-- think
of a neural net as something that can learn really
complicated functions that map from input to output. Now that sounds
kind of abstract. You think of functions as like
y equals x squared or something. But really these functions
can be very complicated and can learn from
very raw forms of data. So you can take the
pixels of an image and train a neural
net to predict what is in the image as a
categorical label like that's a leopard. That's one of my
vacation photos. From audio wave
forms, you can learn to predict a transcript
of what is being said. How cold is it outside? You can learn to take input
in one language-- hello, how are you-- and predict the output being
that sentence translated into another language. [SPEAKING FRENCH] You can even do more
complicated things like take the pixels of an
image and create a caption that describes the image. It's not just category. It's like a simple sentence. A cheetah lying on top of a
car, which is kind of unusual anyway. Your priority for that
should be pretty low. And in the field
of computer vision, we've made great strides
thanks to neural nets. In 2011, the Stanford
ImageNet contest, which is a contest
held every year, the winning entry did
not use neural nets. That was the last year. The winning entry did
not use neural nets. They got 26% error. And that won the contest. We know this task-- it's not a trivial task. So humans themselves
have about 5% error, because you
have to distinguish among 1,000 different
categories of things including like a picture of
a dog, you have to say which of 40 breeds of dog is it. So it's not a completely
trivial thing. And in 2016, for example, the
winning entry got 3% error. So this is just a
huge fundamental leap in computer vision. You know, computers
went from basically not being able to see in 2011 to
now we can see pretty darn well. And that has huge ramifications
for all kinds of things in the world not
just computer science but like the application of
machine learning and computing to perceiving the
world around us. OK. So the rest of
this talk I'm going to frame in a way of-- but
in 2008, the US National Academy of Engineering
published this list of 14 grand engineering challenges
for the 21st century. And they got together
a bunch of experts across lots of
different domains. And they all
collectively came up with this list of
14 things, which I think you can agree
these are actually pretty challenging problems. And if we made progress
on all of them, the world would be
a healthier place. We'd have a safer place. We'd have more
scientific discovery. All these things are
important problems. And so given the limited
time, what I'm going to do is talk about the
ones in boldface. And we have projects in Google
Research that are focused on all the ones listed in red. But I'm not going to talk
about the other ones. And so that's kind of the
tour of the rest of the talk. We're just going to
dive in and off we go. I think we start with
restoring and improving urban infrastructure. Right. We know cities were designed--
the basic structure of cities has been designed
quite some time ago. But there's some
changes that we're on the cusp of that are going to
really dramatically change how we might want to design cities. And, in particular,
autonomous vehicles are on the verge of
commercial practicality. This is from our Waymo
colleagues, part of Alphabet. They've been doing work in
this space for almost a decade. And the basic problem
of an autonomous vehicle is you have to perceive
the world around you from raw sensory inputs,
things like light [INAUDIBLE],, and cameras, and radar,
and other kinds of things. And you want to build a model
of the world and the objects around you and understand
what those objects are. Is that a pedestrian
or a light pole? Is it a car that's moving? What is it? And then also be able to predict
both a short time from now, like where is that car
going to be in one second, and then make a set of
decisions about what actions you want to take to
accomplish the goals, get from A to B without
having any trouble. And it's really thanks
to deep learning vision based algorithms and fusing
of all the sensor data that we can actually
build maps of the world like this that
are understandings of the environment
around us and actually have these things operate
in the real world. This is not some
distant far off dream. Waymo is actually
operating about 100 cars with passengers in the
back seat and no safety drivers in the front seat in
the Phoenix, Arizona area. And so this is a
pretty strong sense that this is pretty
close to reality. Now Arizona is one of the easier
self-driving car environments. It's like it never rains. It's too hot so there aren't
that many pedestrians. The streets are very wide. The other drivers are very slow. Downtown San
Francisco is harder, but this is a sign that
it's not that far off. Obviously, a vision
works, it's easier to build robots that can
do things in the world. If you can't see, it's
really hard to do things. But if you can start to
see, you can actually have practical
robotics things that use computer vision to then
make decisions about how they should act in the world. So this is a video of a
bunch of robots practicing picking things up, and then
dropping them and picking more things up, and essentially
trying to grasp things. And it turns out that one
nice thing about robots is you can actually
collect the sensor data and pool the experience
of many robots, and then collectively train on
their collective experience, and then get a
better model of how to actually grasp things,
and then push that out to the robots. And then the next
day they can all practice with a slightly
better grasping model, because unlike
humans that you plop on the carpet in
your living room, they don't get to
pool their experience. OK. So in 2015, the success rate
on a particular grasping task of grasping objects that a
robot has never seen before was about 65%. When we use this
kind of arm farm-- that's what that
thing is called. I wanted to call it the
armpit, but I was overruled. Basically, by collecting
a lot of experience, we were actually able to get
a pretty significant boost in grasp success
rate, up to 78%. And then with further work on
algorithms and more refinement of the approach, we're now
able to get a 96% grasp success right. So this is pretty good
progress in three years. We've gone from a third of the
time you fail to pick something up, which is very hard to
actually string together a whole sequence of things and
actually have robots actually do things in the real world, to
grasping almost working quite reliably. So that's exciting. We've also been
doing a lot of work on how do we get robots
to do things more easily. Rather than having them
practice themselves, maybe we can demonstrate
things to them. So this is one of our
AI residents doing work. They also do fantastic
machine learning research, but they also film demonstration
videos for these robots. And what you see here
is a simulated robot trying to emulate from the
raw pixels of the video what it's seeing. And on the right, you see a
few demonstrations of pouring and the robot using
those video clips, five or 10 seconds of
someone pouring something, and some reinforcement learning
based trials to attempt to learn to pour on its own. After 15 trials and about
15 minutes of training, it's able to pour
that well, I would say like at the level
of a four-year-old not an eight-year-old. But that's actually much-- in 15 minutes of
effort, it's able to get to that level of success,
which is a pretty big deal. OK. One of the other areas that
was in the grand challenges was advanced health informatics. I think you saw in
the keynote yesterday the work on lung cancer. We've also been
doing a lot of work on an eye disease called
diabetic retinopathy, which is the fastest growing cause
of blindness in the world. There's 115 million people
in the world with diabetes. And each of them ideally
would be screened every year to see if they have
diabetic retinopathy, which is a degenerative eye disease
that if you catch in time it's very treatable. But if you don't
catch it in time, you can suffer full or
partial vision loss. And so it's really
important that we be able to screen everyone
that is at risk for this. And yeah. Regular screening. And that's the
image that you get to see as an ophthalmologist. And in India, for
example, there's a shortage of more than
100,000 eye doctors to do the necessary amount
of screening of this disease. And so 45% of patients
suffer vision loss before they're diagnosed,
which is tragic, because it's a completely
preemptible thing if you catch it in time. And basically, the way an
ophthalmologist looks at this is they look at these
images and they grade it on a five point scale, one,
two, three, four, or five, looking for things like
these little hemorrhages that you see on the
right hand side. And it's a little subjective. So if you ask two
ophthalmologists to grade the same image, they
agree on the score, one, two, three, four, or five,
60% of the time. And if you ask the
same ophthalmologist to grade the same image
a few hours later, they agree with themselves
65% of the time. And this is why second opinions
are useful in medicine, because some of these things
are actually quite subjective. And it's actually a big
deal because the difference between a two and a three
is actually go away and come back in a year versus
we better get you into the clinic next week. Nonetheless, this is actually
a computer vision problem. And so instead of having a
classification of a thousand general categories
of dogs and leopards, you can actually just have
five categories of the five levels of diabetic
retinopathy and train the model on eye images
and an assessment of what the score should be. And if you do that,
you can actually get the images labeled by
several ophthalmologists, six or seven, so that you reduce
the variance that you already see between ophthalmologists
assessing the same image. Five of them say it's two. Two of them say it's a three,
it's probably more like a two than a three. And if you do that,
then you can essentially get a model that is on
par or slightly better than the average board
certified ophthalmologist that's set at doing this
task, which is great. This is work published
at the end of 2016 by my colleagues in "JAMA,"
which is a top medical journal. We wanted to do
even better though. So it turns out you can
actually, instead of-- you can get the images labeled
by retinal specialists who have more training in
retinal eye disease. And instead of getting
independent assessments, you get three
retinal specialists in a room for each image. And you essentially
say, OK, you all have to come up with
an adjudicated number. What number do you
agree on for each image? And if you do that,
then you can train on the output of this consensus
of three retinal specialists. And you actually
now have a model that is on par with
retinal specialists, which is the gold standard
of care in this area, rather than the
not as good model trained on an
ophthalmologist's opinion. And so this is
something that we've seen born out where you have
really good high quality training data and
you can actually then train a model
on that and get the effects of retinal
specialists into the model. But the other neat thing
is you can actually have completely new discoveries. So someone new joined the
ophthalmology research team as a warm up exercise
to understand how our tools worked. Lily Peng, who is on
the stage yesterday, said, oh, why don't
you go see if you can predict age and gender
from the retinal image just to see if the machine
learning pipeline-- a person could get that machine
learning pipeline going? And ophthalmologists
can't predict gender from an eye image. They don't know how to do that. And so Lilly thought the
average that you see on this should be no better
than flipping a coin. You see a 0.5. And the person
went away and they said, OK, I've got it done. My AUC is 0.7. And Lilly is like,
hmm, that's weird. Go check everything
and come back. And so they came
back and they said, OK, I've made a
few improvements. It's now 0.8. That got people excited
because all of a sudden we realized you can
actually predict a whole bunch of interesting
things from a retinal image. In particular, you
can actually detect someone's self-reported sex. And you can predict a
whole bunch of other things like their age, things about
their systolic and diastolic blood pressure, their
hemoglobin level. And it turns out you combine
those things together and you can get a prediction of
someone's cardiovascular risk at the same level of accuracy
that normally a much more invasive blood test where you
have to draw blood, send it off to the lab, wait 24 hours,
get the lab test back. Now you can just do that
with a retinal image. So there's real hope that
this could be a new thing that if you go to
the doctor you'll get a picture of your eye taken. And we'll have a longitudinal
history of your eye and be able to learn
new things from it. So we're pretty
excited about that. A lot of the grand challenges
were around understanding molecules and chemistry better. One is engineer
better medicines. But this work that
I'm going to show you might apply to some
of these other things. So one of the things quantum
chemists want to be able to do is predict properties
of molecules. You know, will this thing
bind to this other thing? Is it toxic? What are its quantum properties? And the normal way
they do this is they have a really computationally
expensive simulator. And you plug in this
molecule configuration. You wait about an hour. And at the end of that you get
the output, which says, OK, here are the things
the simulator told you. So it turns out-- and
it's a slow process. You can't consider that
many different molecules like you might like to. It turns out you can
use the simulator as a teacher for a neural net. So you can do that. And then all of a sudden
you have a neural net that can basically learn to
do what the simulator can do but way faster. And so now you
have something that is about 300,000 times faster. And you can't
distinguish the accuracy of the output of the neural
net versus the simulator. And so that's a completely
game changing thing if you're a quantum chemist. All of a sudden your tool
is sped up by 300,000 times. And all of a sudden
that means you can do a very different
kind of science. You can say, oh, while
I'm going to lunch I should probably screen
100 million molecules. And when I come
back, I'll have 1,000 that might be interesting. So that's a pretty
interesting trend. And I think it's one
that will play out in lots and lots of
different scientific fields or engineering fields
where you have this really expensive simulator
but you can actually learn to approximate it with
a much cheaper neural net or machine learning
based model and get a simulator that's much faster. OK. Engineer the tools of
scientific discovery. I have a feeling this
14th one was just kind of a vague catch all thing
that the panel of experts that was convened decided should do. But it's pretty clear that
if machine learning is going to be a big part of scientific
discovery and engineering, we want good tools to express
machine learning algorithms. And so that's the
motivation for why we created TensorFlow is we
wanted to be to have tools that we could use to express
our own machine learning ideas and share them with
the rest of the world, and have other researchers
exchange machine learning ideas and put machine learning models
into practice in products and other environments. And so we released
this at the end of 2015 with this Apache 2.0 license. And basically it has this
graph based computational model that you can then optimize with
a bunch of traditional compiler optimizations and it
then can be mapped onto a variety of
different devices. So you can run the
same computation on CPUs or GPUs or our TPUs
that I'll tell you about in a minute. Eager Mode makes this graph
implicit rather than explicit, which is coming
in TensorFlow 2.0. And the community
seems to have adopted TensorFlow reasonably well. And we've been excited by
all the different things that we've seen other
people do, both in terms of contributing to the
core TensorFlow system but also making use of it
to do interesting things. And so it's got some pretty
good engagement kinds of stats. 50 million downloads for a
fairly obscure programming packages is a fair
number that seems like a good mark of traction. And we've seen people do things. So I mentioned this in
the keynote yesterday. I like this one. It's basically a company
building fitness center for cows so you can tell
which of your 100 dairy cows is behaving a little
strangely today. There is a research team at
Penn State and the International Institute of Tropical
Agriculture in Tanzania that is building a machine
learning model that can run on device on a phone in
the middle of a cassava field without any network
connection to actually detect does this cassava
plant have disease and how should I treat it. I think this is a
good example of how we want machine
learning to run in lots and lots of environments. Lots of places in
the world sometimes you have connectivity. Sometimes you don't. A lot of cases you want
it to run on device. And it's really going
to be the future. You're going to have machine
learning models running on tiny microcontrollers, all
kinds of things like this. OK. I'm going to use the remaining
time to take you on a tour through some researchy projects
and then sketch how they might fit together in the future. So I believe what we want is
we want bigger machine learning models than we have today. But in order to
make that practical, we want models that
are sparsely activated. So think of a giant model, maybe
with 1,000 different pieces. But you activate 20 or 30 of
those pieces for any given example, rather than the
entire set of 1,000 pieces. We know this is a property
that real organisms have in their neural systems is
most of their neural capacity is not active at
any given point. That's partly how they're
so power efficient. Right. So some work we did a couple
of years ago at this point is what we call a sparsely
gated mixture of experts layer. And the essential idea is
these pink rectangles here are normal neural net layers. But between a couple
of neural net layers, we're going to insert
another collection of tiny little neural
nets that we call experts. And we're going to have
a gating network that's going to learn to activate
just a few of those. It's going to learn
which of those experts is most effective for a
particular kind of example. And the expert might
have a lot of parameters. It might be pretty large
matrix of parameters. And we're going to
have a lot of them. So we have in total eight
billion-ish parameters. But we're going to activate
just a couple of the experts on any given example. And you can see that when
you learn to route things, you try to learn to
use the expert that is most effective at
this particular example. And when you send it
to multiple experts, that gives you a signal to
train the routing network, the gating network so that it
can learn that this expert is really good when you're
talking about language that is about innovation
and researchy things like you see on
the left hand side. And this center expert
is really good at talking about playing a leading
role and central role. And the one on the right
is really good at kind of quicky adverby things. And so they actually do
develop very different kinds of expertise. And the nice thing
about this is if you compare this in a translation
task with the bottom row, you can essentially get
a significant improvement in translation accuracy. That's the blue score there. So one blue point improvement
is a pretty significant thing. We really look like one
blue point improvements. And because it has all
this extra capacity, we can actually make the
sizes of the pink layers smaller than they were
in the original model. And so we can actually
shrink the amount of computation used per word
by about a factor of two, so 50% cheaper inference. And the training time goes
way down because we just have all this extra capacity. And it's easier to train a
model with a lot of parameters. And so we have about
1/10 the training cost in terms of GPU days. OK. We've also been
doing a lot of work on AutoML, which is this
idea behind automating some of the machine learning
tasks that a machine learning researcher or engineer does. And the idea behind
AutoML is currently you think about solving a
machine learning problem where you have some data. You have some computation. And you have an ML
expert sit down. And they do a bunch
of experiments. And they kind of
stir it all together and run lots of GPU
days worth of effort. And you hopefully
get a solution. So what if we could
turn this into using more computation to replace
some of the experimentation that a machine learning-- someone with a lot of
machine learning experience would actually do? And one of the decisions that
a machine learning expert makes is what architecture, what
neural network structure makes sense for this problem. You know, should I use a 13
layer model or a nine layer model? Should it have three by three
or five by five filters? Should it have skip
connections or not? And so if you're willing to
say let's try to take this up a level and do
some meta learning, then we can basically have a
model that generates models and then try those models on the
problem we actually care about. So the basic iteration
of meta learning here is we're going to have a
model generating model. We're going to
generate 10 models. We're going to train
each of those models. And we're going to see
how well they each work on the problem we care about. And we're going to use the loss
or the accuracy of those models as a reinforcement learning
signal for the model generating model so that we can steer away
from models that didn't seem to work very well
and towards models that seem to work better. And then we just repeat a lot. And when we repeat a
lot, we essentially get more and more
accurate models over time. And it works. And it produces models that
are a little strange looking. Like they're a little
more unstructured than you might think
of a model that a human might have designed. So here we have all these
crazy skip connections. But they're analogous
to some of the ideas that machine learning
researchers themselves have come up with in. For example, the
resonant architecture has a more structured
style of skip connection. But the basic idea is
you want information to be able to flow more directly
from the input to the output without going through as many
intermediate computational layers. And the system seems
to have developed that intuition itself. And the nice thing is
these models actually work pretty well. So if you look at
this graph, accuracy is on the y-axis for
the ImageNet problem. And computational
cost of the models, which are represented by
dots here, is on the x-axis. So generally, you
see this trend where if you have a more
computationally expensive model, you generally
get higher accuracy. And each of these
black dots here is something that was a
significant amount of effort by a bunch of top computer
vision researchers or machine learning researchers
that then they published and advanced the
state of the art at the time. And so if you apply AutoML
to this problem, what you see is that you actually exceed
the frontier of the hand generated models that the
community has come up with. And you do this both
at the high end, where you care
most about accuracy and don't care as much
about computational costs so you can get a model
that's slightly more accurate with less computational cost. And at the low end,
you can get a model that's significantly more
accurate for a very small amount of computational cost. And that, I think, is a
pretty interesting result. It says that we should really
let computers and machine learning researchers
work together to develop the best models
for these kinds of problems. And we've turned
this into a product. So we have Cloud AutoML
as a Cloud product. And you can try that
on your own problem. So if you were
maybe a company that doesn't have a lot of
machine learning researchers, or machine learning
engineers yourselves, you can actually just
take a bunch of images in and categories of things
you want to do-- maybe you have pictures from
your assembly line. You want to predict what
part is this image of. You can actually get a high
quality model for that. And we've extended this to
things more than just vision. So you can do videos, and
language, and translation. And more recently we've
introduced something that allows you to
predict relational data from other relational data. You want to predict will this
customer buy something given their past orders or something. We've also obviously continued
research in the AutoML field. So we've got some work looking
at the use of evolution rather than reinforcement
learning for the search, learning the
optimization update rule, learning the nonlinearity
function rather than just assuming we should
use [INAUDIBLE] or some other kind of
activation function. We've actually got some
work on incorporating both inference latency
and the accuracy. Let's say you want a
really good model that has to run in seven milliseconds. We can find the
most accurate model that will run in your time
budget allowed by using a more complicated reward function. We can learn how to augment
data so that you can stretch the amount of label data
you have in interesting ways more effectively than
handwritten data augmentation. And we can explore
lots of architectures to make this whole search
process a bit more efficient. OK. But it's clear if we're going
to try these approaches, we're going to need more
computational power. And I think one of the
truisms of machine learning over the last decade or
so is more computational power tends to
get better results when you have enough data. And so it's really
nice that deep learning is this really
broadly useful tool across so many different
problem domains, because that means you can start
to think about specializing hardware for deep
learning but have it apply to many, many things. And so there are two properties
that deep learning algorithms tend to have. One is they're very tolerant
of reduced precision. So if you do calculations to
one decimal digit of precision, that's perfectly fine with
most of these algorithms. You don't need six or
seven digits of precision. And the other thing
is that they are all-- all these algorithms I've
shown you are made up of a handful of specific
operations, things like matrix multiplies, vector dot
products, essentially dense linear algebra. So if you can build
machines, computers, that are really good at reduced
precision dense linear algebra, then you can accelerate lots
of these machine learning algorithms quite a lot compared
to more general purpose computers that have
general purpose CPUs that can run all kinds
of things or even GPUs which tend to be somewhat
good at this but tend to have, for example, higher precision
than you might want. So we started to
think about building specialized hardware when
I did this kind of thought exercise in 2012. We were starting to
see the initial success of deep neural nets
for speech recognition and for image
recognition and starting to think about how
would we deploy these in some of our products. And so there was
this scary moment where we realized that if speech
started to work really well, and at that time
we couldn't run it on device because
the devices didn't have enough
computational power, what if 100 million users started
talking to their phones for three minutes a day, which
is not implausible if speech starts to work a lot better. And if we were running
the speech models on CPUs, we need to double the number
of computers in Google data centers, which is slightly
terrifying to launch one feature in one product. And so we started to think
about building these specialized processors for the deep learning
algorithms we wanted to run and TPU V1 has been
in production use since 2015 was really the
outcome of that thought exercise. And it's in production use
based on every query you do, on every translation you
do, speech processing, image crossing, AlphaGo use
a collection of these. This is the actual racks
of machines that were competed in the AlphaGo match. You can see the
little Go board we've commemorated with on the side. And then we started to tackle
the bigger problem of not just inference, which is we
already have a trained model and you just want to apply
it, but how do you actually do training in an accelerated way. And so the second
version of TPUs are for training and inference. And that's one of
the TPU devices, which has four chips on it. This is TPU V3, which
also has four chips on it. It's got water cooling. So it's slightly scary to
have water in your computers, but we do. And then we designed
these systems to be configured together
into larger configurations we call pods. So this is a TPU V2 pod. This is a bigger TPU V3
pod with water cooling. You can actually see one of the
racks of this in the machine learning dome. And really these things
actually do provide a lot of computational power. Individual devices
with the four chips are up to 420 teraflops have
a fair amount of memory. And then the actual
pods themselves are up to 100 petaflops of compute. This is a pretty substantial
amount of compute and really lets you
very quickly try machine learning research experiments,
train very large production models on large data
sets, and these are also now available through
our cloud products. As of yesterday, I think we
announced them to be in beta. One of the keys to
performance here is the network interconnect
between the chips in the pods is actually your
super high speed 2D mesh with wrap around links. That's why it's toroidal. And that means you can
essentially program this thing as if it's a single computer. And the software
underneath the covers takes care of distributing
the computation appropriately and can do very fast all
reduced kind of operations and broadcast operations. And so, for example, you
can use a full TPU V2 pod to train ImageNet in 7.9 minutes
versus the same problem using eight GPUs. You get 27 times faster
training at lower cost. The V3 pod is actually
even substantially larger. You can train an
ImageNet model in scratch in less than two minutes,
more than a million images per second in training, which is
essentially the entire ImageNet data set every second. And you can train very
large BERT language models, for example, as I was
discussing on stage in the keynote yesterday
in about 76 minutes on a fairly large corpus of data
which normally would take days. And so that really helps
make our researchers and ML production
systems more productive by being able to
experiment more quickly. If you can run an experiment
in two minutes, that's a very different kind of
science and engineering you do than if that
same experiment would take you a day and a half. Right. You just think about
running more experiments, trying more things. And we have lots of
models already available. OK. So let's take some of
the ideas we talked about and think about how
they might fit together. So I said we want these
really large models but have them be
sparsely activated. I think one of the things we're
doing wrong in machine learning is we tend to train a
machine learning model to do a single thing. And then we have a
different problem. We tend to train a different
model to do that other thing. And I think really we should
be thinking about how can we train models that
do many, many things and leverage the
expertise that they have in doing many things to then
be able to take on a new task and learn to do that new task
more quickly and with less data. This is, essentially,
multi task learning. But often multi task
learning in practice today means three or four
or five tasks, not thousands or millions. I think we really want to be
thinking bigger and bolder about really doing in the limit
one model for all of the things we care about. And obviously,
we're going to try to train this large model
using fancy ML hardware. OK. So how might this look? So I imagine we've
trained a model on a bunch of different tasks. And it's learned these
different components, which can be sometimes shared
across different tasks, sometimes independent,
specialized for a particular task. And now a new task comes along. So with the AutoML style
reinforcement learning, we should be able to use an
RL logarithm to find pathways through this model
that actually get us into a pretty good
state for that new task, because it hopefully has some
commonalities with other things we've already learned. And then we might have some way
to add capacity to the system so that for a task where we
really care about accuracy, we can add a bit of capacity and
start to use that for this task and have that pathway be more
specialized for that task and therefore hopefully
more accurate. And I think that's an
interesting direction to go in. How can we think more about
building a system like that than the current kind of
models we have today where we tend to fully activate the
entire model for every example and tend to have them
just for a single task? OK. I want to close on how we should
be thinking about using machine learning and all
the different places that we might consider using it. And I think one of
the things that I'm really proud of as a company
is that last year we published a set of principles
by which we think about how we're going
to use machine learning for different things. And I think these
seven things when we look at using machine
learning in any of our products or settings we think carefully
about how are we actually fulfilling these
principles by using machine learning in this way. And I think there's more on
the actual principles website that you can go find, but I
think this is really, really important. And I'll point out that
some of these things are evolving research
areas as well as principles that we want to apply. So for example, number two,
avoid creating or reinforcing unfair bias. And bias in machine
learning models is a very real problem that you
get from a variety of sources. Could be you have
biased training data. Could be you're training
on real world data and the world does
itself is biased in ways that we don't want. And so there is research that
we can apply and extend in how do we reduce
bias or eliminate it from machine learning models. And so this is an example
of some of the work we've been doing on
bias and fairness. But what we try to do
in our use of ML models is apply the best
known practices for our actual
production use but also advance the state of the art in
understanding bias and fairness and making it better. And so with that, in conclusion,
deep neural nets and machine learning are really
tackling some of the world's great challenges I think. I think we're really making
progress in a number of areas. There's a lot of
interesting problems to tackle and to still work on. And they're going to affect
not just computer science. Right. We're affecting many, many
aspects of human endeavor like medicine, science,
other kinds of things. And so I think it's a
great responsibility that we have to make sure
that we do these things right and to continue to push
for the state of the art and apply it to great things. So thank you very much. [MUSIC PLAYING]