[MUSIC PLAYING] MARTIN GORNER: Please settle in. I see people are still coming. I see empty spaces. Make yourself comfortable. While you settle
in, can you tell me has anyone built a
neural network in this room? Raise your hand. OK, quite a few. And among the others
if I say things like, let's say, convolutional
layer or cross entropy loss, if it rings a bell, if you
know what I'm talking about, raise your hand. OK, a few as well. All the others, you will
not be able to follow. No, I'm joking. No this talk is
specifically designed to take all of you developers
through the learning curve, and today we will build a neural
network together from scratch to something that I believe is
a good quality neural network. And if some of you have seen
other talks from the TensorFlow Without a PhD
series, today we will be focusing on the latest
advances in visual processing neural networks and what are
the architecture ideas that go into a good
neural network today. So let's start, and for
that we need a dataset. So what can we do? Let's head to Kaggle, the
data science community. I went there, and I found
this dataset of little 20 by 20 tiles with airplanes
and non-airplanes. Actually plane spotting
was quite a nice activity. Why don't we build something
that can recognize airplanes from not airplanes and then
continue building an airplane detection neural network. So how do we start? Well, let's take the
neural network manual 101 on the first page and show-- the first page, it says this
is a fully connected or dense neural network. So the image comes in
as a set of pixels, and we flatten all the
pixels into one big vector. And we will be processing
that big vector. The white circles
you see are neurons. So a neuron in a neural network
will always do the same thing. A neuron does a weighted
sum of all of its inputs and then feeds the sum through
what is called an activation function. That's just a function,
number in, number out. But in neural networks,
this activation function is always non-linear. This is what-- this is the key
thanks to which neural networks can solve non-linear problems. So you can layer those
layers of neurons. The second layer, instead of
doing weighted sums of pixels, does weighted sums of the
outputs of the previous layer. You could have as many
layers as you want. And then at the very
end, I added a layer with just two neurons. And here, since I'm building
a classifier, something that will classify those
images into airplane, non-airplane, my hope is that
with the correct weights, these two neurons, one of them
will have a very strong output when this is a plane,
and the other one will have a very strong
output if this is not a plane. So we will need to choose the
activation functions correctly. And again looking at the
neural network 101 manual, most of the time the
activation function you use is called a ReLU. And the manual says if
you're building a classifier, there is one exception
on the very last layer. You will use a different
activation function, which is called softmax. So I will dive into those
two on the next slide, but first let's write
the code for this. So this is what it looks
like in TensorFlow. The first line just
flattens all the pixels as one big vector of pixels. And then in TensorFlow, you
have this high level API, which can instantiate an
entire layer in one line. So here I have instantiated
my three layers. The first one with 200 neurons
then 20 and the last one with two neurons. As you see the
intermediate layers are activated with this
ReLU activation function. And at the end, I need to
do something different. So if you are
building a classifier, let's follow the recipe. The recipe says if you're
building a classifier, the last layer must have the
softmax activation function, and you will apply-- you will compute
now some distance between what your
network is predicting and the correct answer. You-- we are doing here
supervised training. So we don't know initially
what those weights are. We are doing weighted
sums, but we don't know what those weights are. And we will be feeding
examples into this model where we know the answers and
comparing what the model is predicting against our answers. So we need a distance function. And in a typical classifier,
the distance you will use is called cross entropy. I don't even
necessarily need to know at this point what that is. TensorFlow actually
has a function-- a combined function-- that applies this lost
softmax activation and computes the
cross entropy distance between what the
network predicts and the correct answer. This is called a
loss error function, and that's what we
want to minimize. So as soon as you
have built your layers and computed your loss,
an error of function, TensorFlow can take over. You pick any of
the optimizers that are available-- here I
chose Adam optimizer-- and you kindly ask it
to minimize this loss. This will give you a
training operation, which TensorFlow will
then repeat in a loop, and what happens in this
training operation is all the training magic. TensorFlow will look at your
loss, will differentiate it relatively to all the losses-- sorry, to all the
weights in the system, obtain something that is called
a gradient mathematically, and apply an algorithm
gradient descent that figures out how to change
the weights in your network so as to minimize this loss--
so as to make this loss smaller. And so you will be
feeding in images and correct answers in batches. And at each batch,
TensorFlow slightly adjusts the weights
in your network so as to make the loss,
the distance between what your network is predicting and
the correct answer, smaller and smaller and smaller. That is how you train
a neural network. So just a little look at those
two activation functions. So here is again just
one neuron, weighted sums of all of its inputs. Usually you add
something called a bias. That's another degree of
freedom, something that will be determined by training. And you feed this through
an activation function. So this ReLU
function that we have used on the intermediate layers,
this is what it looks like. It's very, very
simple function, 0 for all negative
values, identity for all positive values. The softmax activation function
that you use on the last layer, here I have represented the
last year with 10 neurons, so that's if you're
doing a classification into 10 categories. So that softmax function
is slightly more complex. Well, actually it's
just an exponential, so you compute a weighted
sum and elevate it to the exponential. And then you normalize
across those 10 neurons. Well, the output-- I put in a little animation. Here is what it does. Boom. It pulls the winner apart. But without destroying--
it's not max. It doesn't completely
nullify all the bad answers. So you still have
signal if your network is mis-recognizing something. That's why it's called softmax. It's max but in a soft way. It pulls the winner apart,
but it doesn't discard all the bad-- all the negative
information, which is still useful for training. So we have all that. We have built our network. Now we can train it. We have our dataset. During training--
well, so here is what goes in the witch's brew. So ReLU on softmax,
the two activation functions, the cross entropy was
the distance function we used. And it's ready. Let's test. Actually before we go there,
a word about the tooling we're using here. So this is-- the code
is in TensorFlow. The tool I'm using to
actually run the training is ML Engine on Google's Cloud. It has one super
useful feature that I love is that whatever
infrastructure I run my training on, when the
job is done, it shuts it down. That's not rocket
science, but I just can't be bothered shutting
down my machines because-- and figuring out of
my jobs I've done. ML Engine gives me
the job-based view. I can show you here
that I need to work my jobs running finished. And then when they are finished,
the machine or the cluster goes down. And finally TensorBoard,
that's a visualization tool that you have in TensorFlow
where you see all your curves and you can see what's going on. So I will-- oh, no. I can't skip this. When handling images--
so I will have to introduce another
piece of technology here. Those dense neural networks
that you have seen, they work well for
a lot of things, but for images you
need something else. And those are called
convolutional neural networks. So bear with me here. In a convolutional neural
network, what is called a neuron behaves a
little bit different. A neuron-- this little cube
here, that's the output-- sees only a little fraction
of the image right above him. It doesn't do weighted sums
of all the pixel of the image, just a little portion. And then the next
neuron actually does a weighted sum of a little
portion just above him but using the same weights. So it's actually a
filtering operation. You pick a set of
weights, as many weights as I have highlighted cubes
in the image over there. My image has the red,
green, and blue channels because it's a color image. So if you have red,
green, and blue channels. And I'm doing weighted
sums of all these pixels but using the same
weights at each position. It's a filtering operation. And once I have moved this
filter across my entire image with correct padding
on the sides, I have as many outputs as
I had pixels initially. So how many weights did I use? Well, as many as the
highlighted pixels. So that's 4 by 4 by 3,
which is 4 by 4 16, 48-- 48 weights. Typical neural
networks have something in the tens or hundreds
of thousands of weights. So we need a way of giving
this more freedom, more weights to play with. And a good way is to pick
another set of weights and repeat the operation. So you pick just another set of
weights, repeat the operation, and you obtain a new channel
of data in the output. And you can do that
again and again, which means that a convolutional
layer in a neural network will transform a data cube
into another data cube. And this is the shape
of this convolutional-- the matrix of weights. So the first two digits-- the first two numbers 4 by 4-- that's the size of
the filter in pixels. The third digit is how many
channels of information here you are reading in the
input image so three channels red, green, blue. That's this number here. And then you repeat this
operation four times with four different sets of
weights, and as a result, you obtain four different
channels of outputs. And the four is this
last number here. So this is a
convolutional layer. And since it has a
number of input channels and a number of output
channels, you can chain them. And a convolutional
network will be a sequence of those layers
transforming the data cube into another cube
then filtering that data cube again, transforming it
into a new data cube and so on. So this data cube can
grow in both directions, the vertical direction or
the horizontal direction. So we have seen here in the
vertical direction, that's just the number of times
you repeat the filtering operation with a
different set of weights. But how do you adjust the size
in the horizontal direction? Usually the idea is to
go from an image and boil the information
down into something smaller like recognizing
what is in the image. So two choices
for that actually. We have many options. The first one is to play with
the step of the convolution. Instead of doing those
weighted sums pixel by pixel, you jump every second pixel. Well, mechanically you
obtain twice fewer results in the output. That's the stride
parameter that you see here, stride 2 or stride 1. And there is a second option,
which is actually more used, and that's called max pulling. And here the idea
is interesting. These are filtering operations. So as the network
trains, you will-- those filters will
train to pattern match or recognize certain
features in the image. Let's say there
is one that trains to recognize little horizontal
lines, another one that specializes in vertical
lines, and so on. So the output of the
filter is basically something like here I have
seen a little horizontal line. Here I have seen nothing. Here I have seen
nothing and so on. The max pulling operation
takes four of those in a square and just keeps the
max, which makes sense because you are interested in
the value of the filter that was maximum at that
point because that is where the filters has
actually seen something. The other ones, which
again-- which you discard, are I've seen nothing and
that's not really interesting. So this is a basic
subsampling operation. You take your image. You take all-- well, not your
image but your data cube. You take the data points
4 by 4 and squares 2 by 2 and just keep the maximum. And again that reduces the size
of the data cube horizontally. And the little guy, he
points something out. That is something called
a 1 by 1 convolution. Well, if you're a
mathematician, that doesn't make much sense,
like 1 by 1 filter, that's just multiplying by a constant. That's not very useful. But again we are doing this
filtering multiple times with different constants. So 1 by 1 convolution
actually makes sense. It's the weighted sum of this
little column of data points. And that weighted sum
might be interesting. So we will see later that 1
by 1 convolutions actually in convolution
networks make sense. So this is what we will build. There are three
convolutional layers. We have our image here, three
convolutional layers, and then at the end, we need
to connect this to our softmax layer,
which will do the airplane, non-airplane classification. So the last data
cube, we reshape it. We flatten it out as
one big vector here. And this-- we apply normal
dense layers to this vector and end up with our
softmax, softmax activation, and softmax cross entropy loss
because this is a classifier. After that, a lot
of experimentation and a lot of regularization. So I will not go into this. You have many other
talks in the TensorFlow Without a PhD series that focus
on what is regularization. For now, all you
need to know these are techniques--
standard techniques that can improve the convergence. And I am quite good
at those techniques. So plus another one, which
is hyperparameter tuning. As you have seen, there
are many parameters here. The number-- the
size of the filters, the number of
layers, the strides, and so on and so forth. You can-- well, if you
know what you're doing, you know the acceptable
ranges for those parameters, but still it's a lot of work to
explore this parameter space. So that's why ML Engine gives
you this hyperparameter tuning module where you just define
your space and say go ahead. Try all the combinations. But there are a couple
of ways of trying out all the combinations. The basic one is
just grid search, and that's where
you would start, just map out all the
possible combinations of parameters and search
through the whole grid. It's actually slightly
faster to do a random search than a grid search. That's a bit counter-intuitive. We are rational people. I like the grid, but it turns
out random is slightly better. But ML Engine does
a third algorithm that is called
Bayesian optimization, and I won't go into this one. It's something where,
from one set of runs, it can mathematically determine
which part of the parameter space has been mapped and where
it's still missing information and focus on that other
part of the parameter space in an optimal way. So that is the best way of
doing hyperparameter tuning and ML Engine does that. How well-- this is how
you package your files to go to a ML Engine,
just your Python code in a folder and a config
file, which usually has just this in it
scaled here basic GPU, which means one machine with
the GPU, and then you go run. The run here. You use the G Cloud command
line, G Cloud, ML Engine, job submit training and you train. If you add these lines here to
the config file, you start-- instead of starting a
normal training job, you start a
hyperparameter tuning job. So what do you--
what did I do here? I said I want to
maximize some metric. That is a way of specifying
what your metrics are in your TensorFlow code,
so here my accuracy. I want 50 trials, 10
trials in parallel, again Bayesian optimization. So it derives useful information
from trials for the next ones. So it's better not to run all
the trials at the same time, even if you have the
necessary hardware. And then you say which
parameters you want. So I have one
parameter called LR2, which is an integer, min
max values, the scale. This other one is a
categorical parameter, and it will try all those
parameters in some optimal way. So using all this, the
network you have seen here, this one plus my best knowledge
of regularization techniques plus hyperparameter
tuning, I was able to bring this network
to an accuracy of 99.6%, and I was very,
very proud of myself until I tried this
network in real life. Let me show you a demo. Well, first of all
just a little trick, how do you transform a
classifier into a detector? It's actually fairly easy. If you have a big image,
just cut it out in 20 by 20 tiles slightly
overlapping and maybe a different resolution, which
you resize then to 20/20, run the classifier. Wherever you see a plane,
you put a box there, and you have a detector. So here we have-- that's San Francisco-- and
this was my very first model. This one, not yet very
good, just the data I had, not much regularization,
not much hyperparameter tuning. I guess it looks like
it's doing something. It's not completely horrible. So it's encouraging. Then I use my best skills
to hyperparameter tune and regularize the
hell out of this. And I obtained my 99.6%
accuracy model, which I will now run for you and wait for it. It's absolutely horrible. You see here-- let me
show you lots and lots of false positive everywhere. It's noisy. It's not clean. So here is the first
lesson of neural networks. There is your training data. There is your
evaluation data on which you compute your accuracy. Of course, computing your
accuracy on your training data would be cheating. And then there is real life. And real life has nothing to
do with either your training or your evaluation data. Real life is hard. So with a lot of effort
actually augmenting the dataset, adding tiles of
non-planes, hoping that this would
make things better, I was able to
increase the accuracy of this model a little bit. But as you will see,
it's still not great. And that was the end of my
neural network 101 handbook. So from now on, you've
got to read papers. That's scary, so
before we go there, let me give you a couple
of just TensorFlow tips. To use ML Engine really to
the best of its capabilities, I advise you to wrap
your model into what is called an estimator API. That's just because
in an estimator, we have written for you a
ton of boilerplate coach that is not interesting to write,
things like check points-- regularly outputting
check points so that if your training
crashes after 24 hours, you can restart
from where you were, exporting the model at the end
so that you have something that is ready to deploy to a serving
infrastructure or distributed training. The distributional algorithms
of distributed training also baked in into estimator. And to wrap your model in an
estimator, you need those-- sorry-- those four things here. And then you can run
train and evaluate, which will alternate
training in evaluation phases so that in your output, you get
nice curves with your training data, training metrics,
evaluation metrics. You can compare
the two and so on. Mostly what you need to provide
are those four functions here. So let's go quickly
through them. It's really nothing fancy. The model function,
it's your layers. That's your model. And then it returns
whatever a model is supposed to return,
the predictions the last. You put the loss
into an optimizer, and you've got this
training operation, which the estimator
will run in a loop, and whatever evaluation
metrics you care about. So that's your model. Then the training
input function-- and I'm putting code
on the slides here. You don't try to
read all the code. I will give you the highlights
of what is there in the code. You will not be-- you
will not have the time to see all the syntax. So the training input
function, that's the function that
will define how your data goes into the model. And I use this dataset API. And that's really
good because this data set API is designed for
out of memory datasets. You define what your dataset
is, and then as you-- as your model is training,
the data is loaded, and the loading triggers-- the loading of additional
files from disk if the dataset does
not fit in memory. And by the way, no
dataset ever fits in memory, no real dataset. So here, for example, I'm
reading images and focus on this here. The dataset is
initialized from files, get matching files
in a directory, all the files in a directory. And then I like this syntax. It's really the
workflow I'm used to. I apply some loading
operations, which will load those files and
decompress them and so on. I usually need to shuffle
my data into batches because the training
always proceeds by batches and usually are
repeated indefinitely. And my dataset is done. And finally, the
serving input function. So estimator also
saves periodically a snapshot of your model, which
is ready to be deployed, again, on ML Engine. ML Engine has those two parts. One is for training. The second one allows
you with one click to put your model
behind REST API. But for that, when
your model will be listening behind
this REST API, it will be receiving
data in a certain format, and usually you wants to do
stuff with that before feeding it into your model. So if you don't, this
is the do nothing pass through serving input function. But if you do, this is the
serving input function, which I used to first
decompress the incoming images from JPEG to pixels. And then I also
implemented right in this the scanning operation. You remember to transform
a classifier to detector, you need to cut your
image into little 20 by 20 tiles overlapping
at different resolutions. Actually I was able
to do this on the fly in the deployed model. This is the code, so
it's about 10 lines. Don't read it. But I find it
interesting that I was able to do this on the
fly in the deployed model. And the demo I was
showing here, this is a JavaScript UI, which
is calling into ML Engine, sending a part of the
image to ML Engine and then getting
the results back. So this is actually live. So we need to read papers. There are many papers. For detection and image-- image work and detection,
there are many papers. I want to talk about these two. And a little bit
about the others but mostly focus
on the big ideas. And the first-- oh, yes. And I need a new
dataset unfortunately because now I will be
doing real detection. Those 20 by 20 tiles of
planes and airplanes, that's not good any longer. Now I have to handle-- I want to handle big images
indirectly output a square box around each airplane. So I had to build
my own dataset. In this case, it was
actually possible. I built myself a
little JavaScript UI, and then I went
clicking on airplanes. And, well, I-- and it
was about a day of work, so not so bad here. So let's start
with an inception. This is the paper that
really brought to the table some of the big
ideas in this space. On this side, you see what the
inception model looks like. So all those little squares are
called in convolutional layers. And what you see
is that it's weird. I told you before that
convolutional layers should be sequenced just piled
up one layer, next layer, next layer, next layer. And here we see branches
and then things coming back. What is that? So the big idea here is
that you are somewhere in your convolutional
neural network. And then you ask
yourself the question. What is the best here? Should I now add a 1 by
1 convolutional layer or maybe a 1 by 1 followed by a
3 by 3 or maybe something else? What is best? They have the idea
that you could actually do all of these
things in parallel and simply concatenate in the
vertical direction the results. And they call this a module. Basically during
training, the network will decide which
is the best path to use for a specific
image on a specific task. So this is this
module-based approach. That was one of the big ideas. The second by big idea is
called filtering factorization. So what you see here are-- is a sequence of
two 3 by 3 filters. You see-- let's look
at this bottom one. So this piece of data is used
by some weighted sum of this 3 by 3 square here. And if you look where those
the data points in this 3 by 3 square are coming
from, they actually come from some combinations
of this white data points in this 5 by 5
piece of the previous data. So it looks like
two consecutive 3 by 3 filters do some
combinations of a 5 by 5 zone in the same
way as a 5 by 5 filter will be doing combinations of
data points in 5 by 5 zone. It's not the same combinations. But let's count the weights. A 5 by 5 filter has
5 by 5 by 1 by 1. That's the def-- the
number of channels. That's 25 weights. And now if we count the
weights for three-- two consecutive 3 by 3 filters, it's
3 by 3 plus 3 by 3 which is 18. Two 3 by 3 filters
are 30% cheaper in terms of number of weights
than one big 5 by 5 filter. They're not doing
the same thing, but it's worth checking
if for our purpose it wouldn't be enough. So that's the second big idea. And the third big
idea are those 1 by 1 convolutions,
which again could sound funny to a mathematician. But once you realize that
you are applying many of them and that this 1 by 1
convolution is actually a weighted sum of all
of the pieces of data in this little column,
it's like saying, well, I have many different
filtering results. I have filtered my image for
horizontal lines and then vertical lines and so on. And maybe the
feature I'm looking for is some combination
of those filters. Sometimes I'm looking for
only the horizontal line, a little bit of
the vertical ones. The 1 by 1 convolution can
give me the right combination for that purpose. Of course, the weights
defining the combination will be trained. And one more. This one is a bit of cheating. So at the very end, you do
your convolutional layers, and at the very end, you
want to classify if you're building a classifier. So let's say classify
in five classes, typically you would take the
last data cube here, reshape it into a vector,
apply a dense layer, and then softmax activation, and
you've got your five classes. But if all you're trying to
do is obtain five numbers, there is an easier way. Those dense layers, they connect
everything with everything. They tend to be heavy
in terms of weights. There is an easier way. You need five numbers. Let's take this cube
of data to slice it up horizontally like a piece
of bread in five slices, average those five slices,
and you've got five numbers. You can apply softmax
activation on it if you want. And you got the same
thing with zero weights. So it's a lot cheaper. But warning, it
only works if you care about global information
and only global information. If you want to classify
an image as dog versus cat versus sunset,
yeah, this will work. If you actually care about
the localization of stuff in your image, like we do, you
are averaging your filtering output across the entire image. So you will lose
that information. So not good for our purpose here
but still an interesting idea. To put this all together
into a simple architecture, the inception architecture is
a bit complicated for my taste. But this Squeezenet
paper actually put all of this in a very
elegant architecture. They said let's build everyone
out of these kinds of modules. So we start with a
1 by 1 convolution. What we usually reduce
the horizontal size of our data cube. And then we have
a parallel module, which is 1 by 1
convolution and a 3 by 3 convolution in parallel. We stack the results,
and the stacking usually increases the number of
channels in the data cube again. So they call it a
squeeze and expand layer. And when you put
those two together, they call them fire modules. And your full model
becomes a nice succession of fire modules, max
spooling operations, which decrease the size of
your data cube horizontally. Then again, a sequence
of fire modules then you decrease the size
again horizontally and so on. It's simple, and
I find it elegant. So let's build it. But I wanted to test
this Squeezenet idea against a more traditional
layer, layer, layer, layer, layer architecture. And we'll go into the YOLO
paper, You Look Only Once. That's the detection
paper, which we'll use to actually produce
our squares around airplanes. In that paper, they had this
very simple architecture that they called Darknet. And I thought, well, maybe
we can compare the two. So you see on the Squeezenet
side, we have this-- these squeeze modules
here-- sorry, no, here-- 1 by 1 followed by in
parallel 3 by 3 and 1 by 1 convolution and then
the max pulling operation and then it continues. So you see the data cube
is a small one squeezed, then it's expanded then it's
squeezed then it's expanded, and so on. On the other side,
it's just a sequence. And actually the
depth of the cube is roughly constant
until the end. And we just have
these max pulling operations, which restrict the
size of the image horizontally. So we'll see which
one feels better. Now how do we actually
detect airplanes? This is what the
YOLO algorithm does, and it's You Look Only Once. It's a one-shot detector. It's not You Live Only Once. It divides the
image into a grid. And it will say, well, each
grid cell will now produce a certain number of boxes. So each grid cell will
be designed to output four values, which is xy,
the center of the box-- the yellow box here. And the center can be
anywhere in the grid cell, so x and y are
between minus 1 and 1 relatively to the
center of the grid cell. And it will also compute
the size of this box. The size can be anything. So the box can grow to the-- to
be as big as the image itself. It's not constrained
in the grid cell. And also some confidence
factor, which tells us whether-- if there is a plane here or not. And you can adjust how
many of those boxes each grid cell is
able to generate-- 1, 2, 3. We will see what's best. This is the loss they
have in their paper. So let me-- please allow me
to simplify it a little bit. The first line
actually is kind of OK. It's the error on the
position of the box. So you compare the box you
have against your ground truth, and you see this a squared-- square of differences
of the center, so that's the error
on the position. I like that. The second one is the error
on the size of the box. Well they had rectangles. I have only squares. So let's kill the
hate only with. And for some reason
they thought it was-- in the paper, they
argue why they put in the square root of the
size and not the size itself. I didn't understand
that, so I removed it. The next thing is the error. So this is on
detected airplanes. What is the conference-- the error on the
confidence factor where you had an airplane. Here I just replaced
the ideal value by 1. When you have an airplane,
the ideal confidence is 1. The next line is the
same thing for boxes where you do not
have an airplane, so there the ideal is 0. And the last line, they were
detecting many categories. They were actually
detecting cars and dogs and blah, blah, blah. I have only airplanes. The last line for them was
the missed detection error, detecting a car as a dog. I have only airplanes
or not airplanes. I don't care about this. They also-- so this is
a multi-composite loss. So they had this idea that
maybe the different parts of this loss should be
weighted differently. Again, I didn't understand
why, so I removed them. And that's about it. So the only thing
that is still left-- and that is a hard problem-- is this operator
they called one, which is actually the
assignments between the boxes you generate in
the ground truth. And that's not an easy problem. So, of course,
per grid cell, you know that if you generate--
if you have a ground truth box that is centered in
one of those grid cells, you will want to pair it with
some box generated by that grid cell. But those grid cells are allowed
to generate more than one box. And you're allowed to have more
than one airplane in a grid cell. So you might have more
than one ground truth box. How do you do the pairings? Well, there is a lot of choice. And when I looked in
the paper, they said, well, we did something simple,
and actually our algorithm is not very good for
swarm type things when you have a flock
of birds or an airport full of airplanes. It's not very good for that. Thank you, guys. So, well, we'll see. We'll try to play
with these parameters. How do we build this? So very simply
you've got the end of your convolutional network. You've got a data cube. Split it horizontally
in your end-by-end grid, and then split it vertically
in four or maybe eight or 12, depending on how many boxes you
want to generate per grid cell. And then the red boxes will
become x, the yellow boxes y, the green boxes-- the size of the box
and the blue ones will become the
confidence factor. You just average all
the values in them. And for all of
those that you need to put between
minus 1 and one, you feed them through a hyperbolic
tangent that puts them between minus 1 and 1. And all those that you need
to put between 0 and 1, you feed them through
a sigmoid, and that puts them between 0 and 1. And you have generated
from an image 1, 2, 3, 4 boxes per grid
cell, which detect airplanes. That's it. We're done. The only thing now is the
work of the data scientist. Now starts a slow
but steady process of grinding through all the
hyperparameters of this model and trying to
improve the accuracy. Initially, this is what my
first training run looked like. Actually let me show
you my real numbers because I have them here. So this is initially what I had. I wasn't sure it was actually
doing something at all. But why not. Initially I tried a 4
by 4 grid and generating just one box per grid cell. So not very good. I quickly realized that
shuffling data is actually super important. Big progress in-- so IOU, that's
the intersection over union. It's a measure of how
accurate those boxes are relatively to the ground truth. So I tried with more
boxes per cell like 4 by 4 but generating four
boxes per grid cell. That doesn't work so well. Again, it's a hard problem to
assign them to their ground truth control counterparts. You've got four boxes per
cell, four ground truths. Which one is which? It's hard. So this was not very good. So then I went to 8 by
8 and then generating only one box per cell. Yeah, that's better because
there is no assignment problem there. It's just one box per cell. I bumped it up to 16
by 16 by 2, the grid-- the YOLO grid. That is even better,
and this time I tried to think very hard
about this assignment problem and devise an algorithm
that would rationally do the pairings. I think I went by the
distance from the center, so the first box is
more susceptible to go with airplanes, which are
closer to the center of the grid cell and the second one
more on the periphery. Then it turned out
that data augmentation, varying at random the hue and
the orientation of the images, that actually helps a lot. And finally, that loss waiting
idea, the multi-part loss, figuring out that some
parts should weigh more, it was actually a good idea. And that gave me a little
bit more accuracy as well. And on here you can see in
the reverse order, the loss-- so the error function
going down as we do this. And finally the best one
was with more layers. So these were worth 12 layers. The last one is worth 17 layers. And then when we got something. So you want to see
this in a demo right? Let's go. First model, so
this is what we had, the best model obtained
from the classifier. Now let's try our first
16-16 counted by by 2, so that's the YOLO grid, your
catch images in 16 by 16 grid, and you generate two
boxes per grid cell. Let's go. Analyze. It's much cleaner, but it's
missing a lot of planes here. This is something you can solve
with the data augmentation. Actually I did not have
any data augmentation here. By varying the
orientation of my cells and the hue of the
cells randomly, you got a big boost
in inaccuracy. Now we're talking. Now it looks like a detector. We still have some
false positives. There is one here. I saw a couple of other ones,
but it is not so bad actually. And if we jump to
a bigger model, I think that from
12 to 17 layers-- I think we are pretty good. Let's see if we are perfect. We'll see. Almost perfect. Well, it missed this
blue airplane here, but that's because I need more
detail with blue airplanes. But it's not so bad. And I was able to
piece this together with LEGO bricks, which appear
in literature without being a super specialist
in machine learning. And my message to you here
is that it's kind of hard, but it's not like
fluid dynamics hard. If you are a good
developer, this is a learning curve
through which you can go. I went through this
learning curve and look. My airplane detection
model now works. It will always have an accuracy
of 99-point-something percent. You can never go to 100. That's something you have
to build into your use case, your product. But you-- all of you here
are capable of building machine-learning models. Then just to finish,
I want to show you the tooling I was using. So I'm using ML Engine. ML Engine has basically
two great features. One is My Jobs. I see all my jobs running. And the other one is My Models. So with one click, I can
deploy a model to production, and it's served behind a
REST API with auto scaling. And I don't have
to manage anything. I really love that. I love my job
tuning those models. I do not like the
job of chasing VMs and which one is still running
and why blah, blah, blah. ML Engine, once you wrap your
model in this estimator API, it gives you a
one-line config access to many different
hardware architectures, including TPUs actually. But let's start
with something else. So this is what I had initially
training on just one GPU, and this is the config
file that you use for that. Scaled here, basic GPU,
that's one machine with a GPU. This model, the 17-layer
model, the bigger one, trains in 23 hours
for a cost of $28. We have faster GPUs. You can bring this
down to 10 hours like that for the same price. But with a very-- just a config
change, no change in your code, you can actually deploy this to
full cluster of five machines with five GPUs each. And it's a cluster, so you can
go to 100 if your model scales. With this one I'm down
to 4 and 1/2 hours. That's-- that means
much more productive. And with the better GPUs,
that actually below two hours. I wish I had that in
the very beginning. But then I thought, well,
let's try those TPU things. First of all, a little warning. Those TPUs, it's
new architecture. There is a little bit
of a porting effort to adapt your model to TPUs. We are working on that,
but right now expect to spend some time
tweaking the code. It will still work on
GPUs once you tweaked it, but there are some
things that need to be done for this new architecture. It's a completely new chip. Oh sorry. Before that, new in TensorFlow
are those distribution strategies. So very soon-- as soon as
they ship-- it's in beta now-- you will be able to reserve one
box with multiple GPUs in it, and that's a
different trade off. So you don't have the network
communication between the GPUs, but on the other hand, you can
put 100 of them in one box. So it's a different trade off. And again, it will be available
with just a conflict change. So now on to TPUs. This is available. I just ported it,
so I don't want yet to show the numbers
because it's not really benchmark with a
proper benchmarking set up. But those numbers
are not secret. Cloud TPUs are all available
on ML Engine today, and I published a code
for this yesterday. So all you have to do is run it,
and you will see the numbers. And, of course, if they are--
if I am putting them here, it's because the are
more-- faster and cheaper than the second best option
you have here on this slide. And very soon, you will
be able to access not just one of those TPU boards but 64
boards so a full rack of them connected with a
high-speed interconnect and use all of that as
one big supercomputer straight from ML Engine
with just a conflict change. If you want to know
more about this, there is a session exactly on
TPUs right after this session, but you'll have to
cross the street. It's a very good session. So that's it. Thank you for your attention. We have seen this-- the cloud ML Engine and
cloud TPU as products. My take away is not
on the products, but what I want
you to remember is that if you are
a good developer, you can build a
machine-learning model. There are best practices. There are Lego
blocks that exists. You have to follow where
this is state of the art, but then you can piece
those pieces together. And if you want-- if you don't
want to build your own models, well, we have a set
of pre-trained models. And new we also have
this Cloud Auto ML product that actually
this morning, it was announced that it was not
doing just Vision but also more things, so I need
to update my slide. And with that, you can
just throw your data at it and let the system figure out
the architecture of your model. That's quite advanced. I'm amazed that we can build
this, and I find this really, really, really,
really marvelous. And finally if you want to learn
more about how to build models, you can check out the other
chapters of the TensorFlow Without a PhD series. And all the code and
code labs and all that is available on GitHub. You've got the URL over there. Thank you very much. [MUSIC PLAYING]