JOSH GORDON: So I'm
Josh from Google. And today we're
going to do a getting started with TensorFlow 2.0. Just while I'm setting up,
could you do me a favor and raise your hand
if you're taking a machine learning class? Can be academic or online. Awesome. How about a deep learning class? Half. OK, so I know a
couple people haven't taken a machine learning class. I will assume but most have. So I'll assume that you're
familiar with machine learning but you're relatively
new to deep learning. We'll come back to that. All right. So let's see. Today we're going t
talk about TensorFlow 2. Here's what we're going to
try and get through as much as we can. So I will talk for
like five minutes. And then I'll stop talking. And you can do a quick exercise. And as always, we're going
to start with MNIST because, why not? After that, we'll
look at convolution. I have a lot more to say
about convolution because I find it much more interesting. When you're learning
deep learning, oftentimes you'll
start by looking at these ridiculous
pictures of these fully connected deep neural networks
with a stack of dense layers. And personally, when I
see something like that, I have approximately
zero intuition for how and why they work. So that sucks. And it's horrible. But if you look,
surprisingly, at a convolutional neural network,
which sounds much fancier, I find it much more intuitive. Deep learning is,
unfortunately, concept soup. And this state of deep
learning right now is that you can write a deep
neural network in five minutes. I'll show you how to do that. Actually, less
than five minutes. But it can take six plus
months to really get familiar with all the concepts. But this is a good thing. And it means you can spend more
of your time thinking and less of your time messing
around with the frameworks. So that's great. And then, assuming
we get through that, I will talk through some
more advanced stuff. Some of my favorite examples-- Deep Dream, Style Transfer,
time series, stuff like that. All right. So I will come back to-- actually, let me just talk
about what deep learning is. So here's a picture that
I pulled from Wikipedia, of where we are. And I ran it through our latest
tutorial for a Deep Dream. And this is that photo
Deep Dream-ified. So if you take a look
at this for a second, what do you see occurring
in the Deep Dream-ified? And then I'll explain
how this relates to the fundamental ideas
behind deep learning. I know this is a bit
of a random aside. But I wanted to start by
talking about something a little bit more
interesting than MNIST. So what do you see
in this photograph that wasn't here before? Yeah. There's a beaver. Yeah, there's lots of
cute little beavers. And there's quadra eyed beavers. There's sheep dogs, eyes
everywhere, peacocks, snakes, stuff like that. So deep learning is
representation learning. And let me explain
what that means. When I started studying
machine learning, the models that I learned
about were decision trees. And I absolutely
love decision trees because if you train a tree
and you ask the question, "How is it that this tree is
classifying a piece of data?" you can print out the
tree and read the rules. It's awesome. Really, really important. So for those of you who've
taken a machine learning class, think about what
would happen if you tried to classify a photograph
like this using a decision tree. The features that
the tree can look at are going to be the pixels. And so that means if you're
the root node in the tree, you'll find whatever
pixel in your training set happens to be the most
informative to split the data. And you'll ask a
really silly question. You'll say like, "If pixel
intensity is greater than 128, then ask about the
next pixel intensity." And on 1,000 by 1,000
by three image-- three because there's three
color channels, red, green, and blue-- you have three million
features, none of which are really informative at all. And so if you think
about how wide and how deep the
decision tree that you might train to classify
an image is, it's useless. It doesn't mean anything. What deep learning
does is you basically-- I'm going to fast forward
like 80 slides just to show the idea that I
want to talk about here. What deep learning
in a nutshell is and why I like talking
about convolution. This is what deep learning does. So basically, the
way I like to think of it is, what we're
looking at here is this is a deep
convolutional neural network. And we'll come back to this. But the way I like to
think about deep learning is there's two parts. The first part is what you
see on the bottom here. And this is what you would cover
in a machine learning class. And here, this is a schematic
for multi-class logistic regression. What each of these little
cubes represents is a feature. And if we were working
with raw image data, these features would be pixels. And they're fully connected
to an output layer. And you can imagine maybe
this output node or neuron is collecting evidence that
it's a cat, it's a dog, it's a sheep, whatever. So this is just a multi-class
logistic regression unit. What the deep in deep learning
does is the convolution of base that we're looking
at above this. What the base does, it's a
series of convolutional layers, in this case, or a series of
dense layers, which I hate, in other cases. And what they're
doing is they're looking at the raw pixels
from the input image. And they're extracting
features as they go. So the purpose of
the first layer is to transform
pixels to the edges. The second layer, from
edges to textures, textures to more
complex textures, and so on, and so forth. What this means is
that by the time you're training a
logistic regression model, it's no longer taking
the pixels as input. Instead, these are
high level features. And they're high level features
that were automatically learned from the data. So deep learning learns a
representation of the data that you can classify with a
linear layer, in a nutshell. So other words for deep
learning are automatic feature engineering or
representation learning. And unlike in traditional
machine learning, where 10 years ago
you might have come up with features like
shapes and textures using a library
like OpenCV, or you would've written a whole
bunch of giant Python pre-processing scripts, you
can learn all these features automatically. And the reason I-- I'm going to flip back like 50
slides, which you should never do in a presentation. So flipping back 50 slides,
the reason Deep Dream is interesting, the reason
we can modify this image to make all these
psychedelic shapes appear, is because we begin
with an image classifier. And this is an experiment where
we're asking the classifier to show us the type of features
that it's learned from data. But we'll come back to that. Anyway, TensorFlow is
an open source machine learning library. For the purposes
of your research, there are many
awesome open source machine learning libraries. The truth is
learning one is hard. After you've learned one,
learning multiple gets easier. What's nice about TensorFlow
2, which is what I work on and I'll talk about today,
is you can very, very roughly think of TensorFlow 2 as
Keras, plus PyTorch, plus a lot of other awesome stuff. And the reason I teach
TensorFlow in my classes at night is not because
I work for Google. It's because when students
learn how to use TensorFlow, it's easier for them
to branch to wherever. So it's a good place to start. Here are some resources for you. We're a few weeks away from
releasing TensorFlow 2. It's in beta right now. Our website is
tensorflow.org/beta. And you should skip everything
else on the website that's not there. For news and updates, we
have a blog and a Twitter. I'll share these slides
afterwards, by the way. So you don't have to
write everything down. And then what I
wanted to mention, too, we're going to
use Python today. But machine learning
is very rapidly branching out beyond Python. And I was totally wrong
about this when I was first introduced to this idea. I'm going to point it at you. And I'm probably going to
accidentally unplug things. But if I manage not
to screw this up, this should be really cool. You are now immortalized
on the video as well. Future generations of TensorFlow
students will-- anyway, so all this is running
client side in the browser. So nothing is being sent to a
server, which is a big deal. And you'll notice,
A, that was private. B, that was fast. And that's running
in JavaScript. And the basic idea
there is that a model was trained in Python
using TensorFlow 2, converted to a
JavaScript format using something called TensorFlow.js,
and then deployed in a web page. And there's other
things you can do, too. This, I believe, only
works for one person. But here we're
getting a part map. So there's a lot of value in
running things in the browser. I just wanted to
mention that as an FYI. It's not just Python anymore. I'm going to blaze through this. Also, FYI, another
thing you can do is you can train
models in Python and deploy them on
iOS, Android, Raspberry Pi, embedded devices, whatever. I don't have a slide for this. But let me just tell you. Briefly, TensorFlow
2 is a C++ engine. Any time you write
your code in Python, what happens is behind the
scenes that code is accelerated by C++. The reason this is
important is it's easier to write your code in Python. But also, a lot of the time,
when you run your model, you don't want to run it
using a Python interpreter. You might want to run it
on a phone or in a browser. And so TensorFlow
gives you a way to save your models in a machine
independent format that lets you deploy it where you want. So that's valuable. Anyway, this is the
picture that I hate. But what we're
looking at here is this is a fully connected
deep neural network. And we're looking at a
series of three dense layers. I'll break this down in a bit. Each dense layer is taking a
linear combination of the input features and some
weights, and applying a non-linearity, and forwards
that result to the next layer. Instead of looking at this,
because no one has intuition for that, I want to give
you intuition for what a single dense layer does. So the data set that I want
to talk about is MNIST. And if we Google
MNIST really quickly. MNIST, it's an old
computer vision data set. It's the hello world
of deep learning. There's 60,000 images of digits. The important thing about these
is they're 28 by 28 by one. So they're black
and white digits. What I want to show you
is what a dense layer does, one dense
layer, if you train it to classify MNIST digits. And this is fast. But what we're looking at here,
that's our cartoon dense layer. At the top, those would be
the pixels from a single image that we're feeding
through the network. Each of the gray lines
represents a weight. And each of the green
nodes represents an output. Let's imagine that we've
trained this dense layer for a long time on
all 60,000 digits. And now we ask
the question, what is it that the weights
are doing that lets us classify the images? What we're looking at
here is a visualization of the learned weights. So this is a fully
connected layer, which means there's one
weight for every input pixel. So every pixel in
the image would be connected to one weight. This guy here would be the
upper left input pixel. And the reason
that's an array is, you can imagine, that dense
layers can only take arrays as input. So we've unstacked
the rows of the image and lined it up into an array. So this might be Pixel
1, Pixel 2, Pixel 3. And what we're
doing here is if-- I'm sorry. The weight for Pixel
1, Pixel 2, Pixel 3. What we're doing here is
we've colored the weights. So if a weight is very high,
we've colored it in red. And if a weight is very low,
we've colored it in blue. And if you visualize
the weights, you see this red band
around the output for the 0. And that's because there's many
different ways to draw zeros. But most people don't
cross the center of the image, which is blue. And so what a single dense
layer is doing for every input feature, it's basically
assigning one weight that says if this
feature is present, how much evidence does that
give me that it corresponds to the output class? So a single dense layer is fast. But a single dense layer is
something we can interpret. In terms of writing the
code, let me stop talking. And I'll give you an exercise. And let me show you one more
thing before we do that. I just want to show
you how to write a deep neural network
in TensorFlow 2 in like two seconds. Just so you know. This code right here. We're defining a model. We're adding a
single dense layer. That would correspond to
that diagram right there. If we wanted to go from a dense
layer to a neural network, we would add one line
of code like this. And now we have
a neural network. We've added a
second dense layer. And that's a hidden layer. If we wanted a deep
neural network, we would copy and paste this. And now we have a
deep neural network. And now you will have a
deeper neural network, and so on, and so forth. So what I'm trying to
communicate here is this part. There's like six months
of learning here, like I said earlier. It may be faster for you. But it took me a while to go
through all of these concepts. When you look at this,
I see not a lot of code. And I see a lot of things. First of all, let me give
you some terminology. There's different ways to
define deep neural networks in TensorFlow 2. This is the simplest one. And here, we're saying our
network is a stack of layers. That's what it
means by sequential. This is great 95
percent of the time. As it happens, dense layers
can only take arrays as input. Flatten is a special layer. It's a pre-processing layer. It basically says,
"Give me an image. And I will unroll
it into an array." So the output of this
layer is an array. Great. Terminology. This is the depth of
the network, the number of the layers that we've added. The rough intuition. You'll see this
with convolution. But, roughly, the
more layers you have, the more combinations of
features you can detect. So maybe pixels come
in, edges, textures. And then you have
logistic regression and classify it based on the
textures it's discovered. You also have the
width of the network. And that's the number of
units or neurons per layer. Here, we've pulled
128 out of a hat. The more units per
layer, the more patterns you can
detect at that layer. Great. Lots of hyper parameters here. Designing neural networks for
a problem is a bit of an art. And over time, you learn
basically good starting points. So I've looked at
MNIST for a long time. So I know from experience
what designs might work. Unfortunately, we often have
to search around a little bit. And to be honest, it's pretty
hacky how people do it. If you're working on a more
substantial problem, usually the way you get
a starting design is you find a
paper that's close, use that as a starting
point, modify from there. There's more hyper
parameters, too. There's things like the type
of activation function, which we'll talk about later. Good news. ReLU is almost always
the one you want. Here, what we're saying
is, well, whatever. Softmax is a fancy
way of saying give me a probability distribution. So, the output of some network. The way to read this, you start
by looking at only two things. The first is the input. So this network is going to
take some data that's 28 by 28. So a square is going in,
in this case, an image. And the thing
that's coming out-- you can ignore all this
junk in the middle-- the thing that's
coming out is going to be 10 numbers, all of
which range between 0 and 1. And they sum to 1. So basically, this means give
me a probability distribution based on some input data. All right. Let me give you an exercise
and explain how to run it. Just so you know,
TensorFlow is open source. You can definitely
install it on your laptop. It's great. Today, we're going to run it
in the cloud just to save time. And we're going to use one
of my favorite tools, which is called Colab. Has anyone seen a Colab? Half? OK. We're just going use Colab. If you've any Colab
questions, I'm happy to help. Let's start this right now. So on your laptop,
what you should do is-- I have two exercises for you. And the first one is what
I recommend you start with. If you've been working with
TensorFlow for a long time, I have an advanced
exercise which I'll show you right after this. But if you're new to
it, you should do this. Please go to this link. And what this will do
is this will connect you to our hello world
tutorial for MNIST. It's close to the
minimum amount of code you need to write
an image classifier. And I'll put this slide back. Actually, no. You're going to need this
because I have to have a second slide up in a sec. So definitely write
this one down. Or go to bit.ly/mnist-seq. S-E-Q for sequential. Let me show you. This has a long link. You're going to need
this as a reference. I'm going to bring
it up on my screen and show you how to get to it. So you don't have to
write down this long link. If you go to
tensorflow.org/beta, then you go to Machine Learning
Basics, Classify Images. So this is tensorflow.org/beta,
Machine Learning Basics, Classify Images. This is a good reference if you
want to learn more about MNIST. And I haven't described
it to save some time. But if you need reading,
this gives much more detail on what you're about to do. So, Classify Images. And let me show you what
I want you to get to. What our beginner tutorial
is missing is this diagram. And you're going to add this. So a blessing of
deep neural networks is that if you create
a deep enough network, and the layers are wide
enough, and you create it for long enough, it will
memorize pretty much any data set. And this is great. They're very powerful. We don't want to memorize
the training data, though. What you usually need to
do is get high accuracy on the validation set. So the way-- there's a key
parameter that you need to set. When you are training
your networks, of all the parameters-- meaning how many layers,
what's the width of a layer-- the one that you really
need to get right is this one at the very end. It's epochs. And, roughly, this
corresponds to how long you're training the model for. An epoch means you've used every
example from the training set once to update your weights. So here we are using
them all five times. If this number is too large, you
will overfit the training set. If it's too small,
you'll underfit. So to set it properly,
it's not rocket science. Usually, what we do is
we make plots like this. And here, we're
plotting our accuracy on the training set and the
validation set over time. So epochs is on the x-axis. Accuracy is on the Y. If
we set epochs to like 20, probably the accuracy
on the training set is going to hit one. But what will
happen, you'll notice that the accuracy on
the validation set, it's going to start to drop. And the goal is, basically,
the correct value for the number of epochs. You'd want to stop
training this thing when the accuracy on the validation
set begins to decrease, because that means you're
beginning to overfit. So what you're going to
do, go to bit.ly/mnist-seq. Add plots for the training and
validation accuracy and loss. And then find the
right number of epochs to train that model for. And here is the code. I'm giving you the
code that you can use, so you can start modifying that
example with code like this. And try and get those plots. And then find the
right number of epochs. And why don't we
work on that for-- I'll start talking
again at 4:05. So, 15 minutes. If you've been working with
TensorFlow 2 for a long time, or TensorFlow 1
for a long time-- I'll put that back in a sec. Here's a more advanced exercise. Whoops. Nice. Well, you can see the answer. But try not to see the answer. And look at this later. We read Ichkai a couple
days ago in Macau. And so it's bit.ly, slash,
and my friend had a typo. Here's a bizarre go link or a
bit.ly link-- bit.ly/ijcav_adv. And that's a more
advanced exercise where you write some pieces of
a neural network from scratch. So if you want the events
one, H cav, underscore, A-D-V. And let me put back
the beginner one, which you should probably
start with bit.ly/mnist-seq. And here's some code
that you can use. And then if anyone has
any questions, please raise your hand. And I'll come around. There are one or two
people new to Colab. It might be distracting for a
lot of you, but I'm happy to-- I can give a quick
intro to Colab while you're working on this. Quick intro to Colab, anybody? Awesome, OK. And if you have any questions,
please raise your hand. Another thing you can try. If you add the plots quickly,
the next task that a lot of you are going to care about
is how accurate of a model can you train on MNIST without
overfitting on the validation set. And the way to train
a more accurate model is to add more dense layers
or to increase the width of the dense layers you have. That will give
you more capacity. But the larger your model, the
more likely you are to overfit. And you'll see that there's
layers you can play with, like Dropout and things like
that, that you can read about or I'll talk about
in a little bit. OK I'll keep talking. And we can keep working
on this in a little bit. So, TensorFlow 2. First of all, here's how
you install the thing, if you're not working in Colab. Basically, what I want
to mention right now is, while it's in beta, it's
important to get a named release. So if you want to install
the latest beta, here it is. Just FYI, although in Colab,
if you used it last time, you can enable a GPU with
the Edit Notebook settings. If you want to enable
a GPU, you also need to install the GPU
version of TensorFlow 2. Good practice. While we're working
on upgrading, at the top of your
scripts, just print out what version of
TensorFlow you have just to make sure things are working. So here's the first difference
of TensorFlow 2 and TensorFlow 1. And let me explain
what the name means. So a tensor is a fancy
word for an array. So a scalar is a tensor. An array is a tensor. A matrix is a tensor. A cube is a tensor. So a tensor is an array. Flow refers to a
data flow graph. Under the hood, in C++, a
data flow graph is built for your program,
compiled, and executed. In TensorFlow 2, you don't
need to be aware of that or see it unless, very
rarely, you care about it. So with TensorFlow 2
installed, what we're doing is we're creating two
constants, 1 and 2. And we're going to
add them together. And if we print
this out, you'll see 1 plus 2 is 3, as
you would expect. The shape is saying
it's a tuple. And it has a data type. Roughly, TensorFlow
2 works like NumPy. Instead of NumPy and D arrays,
we have TensorFlow tensors. The main difference
is a TensorFlow tensor can be accelerated on a GPU. And we can back prop through it. Let me show you how this is
different from TensorFlow 1. Actually, before I
get to TensorFlow 1, here's how this starts being
useful in TensorFlow 2. In the last exercise,
you poked around briefly with dense layers. Here, what I've done is
I've just written some code. And I've imported a dense layer. And I've got the
setting in such a way that the behavior
is very simple. And I know what
it's going to do. And then, I'm
creating some data. And I'm forwarding the data
through the dense layer. And you can see, just by
running this in Python, exactly what the result is. So this is a great way to
poke around and exactly understand the behavior of
your layers very easily. So this is really useful. Also TensorFlow tensors. If you get tired of TensorFlow,
they have a dot NumPy method. So you can switch back
from tensors to NumPy. And TensorFlow operations will
work with NumPy and D arrays. And NumPy operations will
work with TensorFlow tensors. So they're close friends. And that should work
most of the time. TensorFlow 1.0 was,
sadly, different. So here, we're going to try
and add some numbers again. And this won't work as expected. So in TensorFlow 1-- this was a long
time ago, in 2015. Basically, TensorFlow
1 is what you would have wanted if you
were an engineer at a very large software company
and your problem was, how can I do massively
distributed deep learning? In TensorFlow 1, you
build a data flow graph. And you need to be aware
of what that graph is. And then, you run the graph. And here, if we make
those constants again, and we print Z, we
don't get three. Instead, what we get is Z prints
out to be this add operation. And that's an operation
on some data flow graph. To actually do the addition,
you had to make a session. And then, in the
session, you would execute-- this should say
Z, not X. You would run Z. And this added a lot
of mental overhead. So this is gone. It works comparatively by
default, which is great. If you want to make-- so I'm going to skip forward
like 50 million slides again. And I'm just going
to cut right to it and show you the one line of
Python you need in TensorFlow 2 to make your code run
fast and in graph mode. The only piece of code you need
to know in TensorFlow 2 that doesn't look like
regular Python is going to be a single
Python annotation. So here's some code. I've created some
random LSTM cell. This is just Python
and TensorFlow 2. I'm making some data. I'm calling the cell. And I have some crappy
benchmark to see how long that takes to run. To accelerate that, I
can add a single line, which is at TF dot function. And let me explain
what this does. In this example, which
is old, by the way, it made it like
nine times faster. That would be
probably slower now. But makes it much faster. Here's how this works. So the reason we're using a
C++ back end is Python is slow at multiplying matrices. This is why NumPy is so popular. You write your code in Python. In NumPy, the matrices
are multiplied on C. The results go back to Python. You get 100x or 10x speed up,
depending on what you're doing. Awesome. One problem with the TensorFlow
program or NumPy program is you're going from
Python to NumPy-- I'm sorry. Python to C, Python
to C, Python to C. So you're ping
ponging back and forth between these environments. So you get latency. If you're a compilers
engineer, which I'm not, there's other things you can
do to accelerate programs if you can look at the
whole program at once. You can compile it. You can prune pieces
that aren't used. Anyway, what TF function
basically says-- and it's applied
recursively-- is take any code that appears in this block. Send it to the back
end all at once. The back end compiles it,
does its magic, does the math, delivers the result once. So you run the whole
code in C. And then you get the result once. So it saves you
from the ping pong. And it can do some tricks. So that's it. So TensorFlow 2 is Python
plus at TF function. And anything you can
stick in TF function, you can stick in
a saved model that will run on devices without
a Python interpreter. So that's good news. While we're here, in case you
do distributed training down the road, distributing
trading in TensorFlow 2.0 is also much, much easier. So basically, ignoring
the indentation mistakes, here's something that looks very
similar to that little MNIST model we looked at a second ago. To run this on one machine with
multiple GPUs, it's just this. So we have different
distribution strategies. The way it works is you create. Every strategy has its
scope, annotation problems. Oh no, it's not. I'm just tired from jet lag. Create a model inside the scope. Compile it. And when you do fit, this
will do data parallelism, which is the easiest way
to do distributed training. What I should tell you is
that this is the easy part. So, wrapping your
model in a scope. And there are scopes
for different context. This is not hard. You just have to
read about the scopes and figure out what they are. What's hard is your
input pipeline. So the bottleneck
is basically going to be reading data off disk
and getting it onto the GPUs fast enough that
they're not starving, which means sitting
around waiting for data. And that's getting easier. Let me just show you what
that involves right now. When you imported the data
set in this little hello world example, we used
these Keras data sets. And Keras is a wonderful library
I was talking about in a sec. It's built in, in TensorFlow 2. It has a lot of small data sets
that you can import in memory. Almost always, your data sets
are not going to be in memory. They're going to
be sitting on disk. In TensorFlow 2, the way
you get a data set off disk and onto the GPUs
quickly is you use something called TF dot data. And briefly, the best
tutorial that we have that you can check
out right now-- and don't do this now,
but for the future just so you have a reference. We're working on
cleaning these up. Load and pre-process data and,
strangely, images is the one. In my experience, that's the
one you wnt even if you're not working with images. But let me explain
how this works. That is not images. TF data is a tool to
build input pipelines. The way it basically
works is this. So the first thing I want to
show you is that TensorFlow 2-- the way to think
about it is NumPy. So if you have a NumPy
operation like NP.sum, you can usually find an
equivalent TensorFlow 2. There's some other stuff, too. We have image
modules with things for like loading
images off disk, and resizing them, and
decoding them, and doing stuff like that. So there's different
utility modules. But here we have some code
that takes some image, and decodes it into a JPEG,
and does some math on it, and whatever. So there's some
TensorFlow 2 code. TensorFlow data
is a tool that you can use to build data
pipelines out of these. So basically, you can start
constructing a data set. And here, you can say
things like a data set is a stream of data. TF data has different
operations that you can apply to that stream that are useful. So here, we're saying
shuffle the data. And maybe later we'll
say batch the data and repeat the data
set for a long time. And then, there's
these interesting tools like Pre-fetch. And this is something
you can do with TF data that you can't do
easily with NumPy. So what Pre-fetch
is trying to say is get the next batch
of data onto the GPU. So it's there when it finishes
processing the current batch. So you don't have latency. And there's all sorts of
fancy tricks like this. TLDR. TF data is useful. It's a bit of a hassle. And it's complex. But if you're doing
larger experiments, it's worth using and
worth worth learning. All right. Let's see. All right. So Keras is built
into TensorFlow 2. And it's a huge part
of TensorFlow 2. Keras is a separate library. Let me explain what this is. If you go to Keras.io, this
is one of my favorite all time libraries next to Scikit-learn. It's a deep learning library. It's wonderful. And what Keras does,
it's basically an API without an implementation. So Keras defines
different ways of defining deep neural networks. And everything at Keras.io
works in TensorFlow 2. Keras defines two, a
sequential and a functional. Sequential is for building
a stack of layers. Functional is for
building a graph. What Keras doesn't say. And you've seen these
things, dense and sequential. Keras doesn't say anything
about how you actually run this code on a GPU. If you do pip install Keras, you
get what you get at Keras.io. And automatically,
behind the scenes, Keras will install what it calls
as a tensor processing library. So it will install
TensorFlow, or MXNet, or CNTK. Call that library
to do the math. You never see it. In TensorFlow 2,
Keras is built in. And TensorFlow 2 is a superset
of what you get at Keras.io. So if TensorFlow 2
is installed, you can say from TensorFlow.keras,
import whatever you want. And any code you
find at Keras.io will work identically
in TensorFlow 2 just by changing an import. So instead of import Keras
from TensorFlow, import Keras. And that's it. So if you're new to
this stuff, Keras is famous for being
one of the easiest to use libraries and
the best documented. It's a perfectly good
place to start learning. Nothing you learn in Keras.io
is a waste of your time because it all works
identically in TensorFlow 2. Just so you know, Colab
has Keras installed also by default. So you have to
be careful with your imports. If you're importing
things from Keras and you see using TensorFlow
back end, that's a mistake. You don't want that. You just want to get your
imports from TensorFlow.keras. I put some notes on the slides
for you when I upload them. All right. So you've seen
sequential models. So that's a stack of layers. That, by the way,
existed in TensorFlow 1. It's the same in TensorFlow 2. It didn't change it all. Functional models are
what you would use to build a model that's a dag. And so if you start
learning about things like residual
networks and things when you've skipped
connections between layers, you can define them
using the functional API. There's a third method
that I'll talk more about, which is the subclasses API. And this feels a little bit
like object-oriented NumPy. This is very, very
similar to a library called Chainer and similar
to a library called PyTorch. And the way this
works is, here, we're defining our model
by extending a class. And this class
happens to be model. It's provided by the library. You can write your own if
you don't like this one. And what we're doing
is, in the constructor, we're defining a
couple of layers. And in the call method, we're
defining the forward pass of our model or our layer. So if you call this
model on some data, you can see that the data will
pass through the dense layer. If you're curious exactly
what the output is, you can just print that out
because it's just Python. This is really, really
great from the research side if you're defining new
layers and stuff like that. You can interactively
see how they work. All three of these model styles
can be trained in two ways. One way is model.fit,
which you've seen, which you should
always use unless you need to write custom code. The way to write a
custom training loop in TensorFlow 2-- and
we'll do this in a second with linear regression-- is
called the gradient tape. So here, what we're doing
is we're creating our model. And then, all these models are
trained by gradient descent. The way we get the gradients
is back propagation. The way TensorFlow will
give you the gradients for the weights in your model is
using something called a tape. What we start doing is we record
all the operations under this with block on a tape. It builds a computational
graph, plays the graph backwards to the gradients. But basically,
what's happening is we're calling the
model on some images. We're getting the
output of the model. We're computing our loss. And then, we're getting
the gradients with respect to the loss of all the
variables in the model. And if you print these out,
these are your gradients. If you were doing
research in optimization, and you're working on the
Rachel optimizer or the Josh optimizer, you can rewrite
this however you want. If you were doing just SGD, you
would multiply the gradients by a learning rate and
update your model variables. Or if you're doing
gradient clipping, it's really easy to write that. So this is a really, really
nice way to do auto-- basically, to do back
prop in TensorFlow. The best way to get
started with poking around with the gradients,
in my experience, is linear regression. So we're going to
skip this stuff. Let's take a look at-- the next
exercise is linear regression. But it's written the slow way. And so we're going
to pretend like we don't have dense layers. We don't have model.fit. Let's do linear regression
with a gradient take. And this is good. So you can actually see what
the gradients are that you get. And let me see what
this notebook gives you. You might have to
clear the output. So it's bit.ly/tf-ws1. And I think the real power of
these deep learning libraries, it's-- regardless of which
library you're doing, it's that they can do auto diff. Once you have an easy way
to get gradients-- here we're going to get them
for linear regression. Great. But almost with
exactly the same code, we get the gradients
for Deep Dream. So this scales up in a
really surprising way. So, tf-ws1. And probably, you're going
to need to clear the output. I think I forgot to clear it. But what we're doing here is
we're going to fit a model, y equals mx plus
b, to some data. So we created some random data. We're going to create a
model, y equals mx plus b. And we're going to use
two TensorFlow variables. Normally, you don't have to
write code at this low level unless you're doing
some sort of research. But here, we're
creating variables. This is the forward pass of
our model, y equals mx plus b. M is the slope. B is the intercept. Our loss is going
to be squared error. When you see names
in TensorFlow that aren't quite identical
to NumPy, that usually means there's a subtle
difference in how these work. And I think the
reason that it says reduced mean,
instead of just mean, is you can imagine
if you have a GPU, and you have a long list
of floating point numbers, and you're taking the average,
if the GPU doesn't guarantee the order in which
it takes the average, it's possible you'll have very
slightly different results every run based on floating
point arithmetic errors. So that's just trivia, why
that's called reduced mean. Anyway, what I wanted
to show you is, at the end of this notebook, if
you're new to gradient descent, it makes this nice little
plot that you can look at . And what we're
seeing here, we're visualizing the
loss of our model as a function of the
slope and the intercept. And that's our starting loss. We get the gradient. The notebook will
give you the code to take a step in the negative
direction of the gradient. And down we go. And here's the
gradient tape loop that I showed you from
the slide in action. And what's cool is
if you run this, you can literally print out
the gradients for m and b and see exactly what
they are, which is cool. So why don't we take-- let's take like eight minutes
and poke around with this. So if you want to
run it from scratch, edit, clear all output. I'll start talking again
shortly. shortly, at like 4:25. So it's bit.ly/tf-ws1. Also, in case you're
new to back prop, let me just point you to
a really nice article. We don't have time to
cover it right now. But if you Google
for this, if you want to learn how
auto diff works, wonderful article by
Chris Olah, Calculus on Computational Graphs. And the reason I
like this article as a teaching tool for back
prop is it actually does it. So it has an example. It's not just like,
here's some equations. So it's really nice
in Chris's article. What he does is he
builds up a very simple computational graph. And this is what TensorFlow
does, too, behind the scenes. This is a computational
graph for-- we're doing like a plus b times
c, or something like that. And he'll build the graph. Show you the forward
pass and the backward pass to get the gradients. It's really, really nice. So, Calculus on
Computational Graphs. By the way, if people are
getting an error message with length, Colab has
TensorFlow 1 installed on it by default. And
we'll get rid of that as soon as TensorFlow 2 is out. So if you're getting-- tensor has no property length. The very first cell will
install TensorFlow 2. And you'll have to run that one. And then, that error
message should go away. So let me briefly explain
gradient descent and gradients. And so I'll do this in two ways. So one is the numeric gradient. And the other is the
analytic gradient. Basically, deep neural
networks work the same way as linear regression
in this sense. You always start--
if you're doing a deep neural network
or linear regression, there are two things you need. The first thing you
need is a model. Here, our model is
y equals mx plus b. With a neural
network, our model is going to be Keras, sequential,
dense, dense, dense. It's a much bigger model. Same thing. When you call the model,
that's called the forward pass. You take some data, pass
it through the model, get a result. The next thing you
need, which is very important, is called a loss function,
which is synonymous with error. And all that is, is
a way to quantify how bad of a
prediction you've made. In linear regression, the loss
function is our squared error. So for the entire training
set or whatever data we forwarded through the model,
we take the point we predicted, which is the blue line,
subtract the point we wanted, which is the blue
dot, square it, and we sum that up over
the whole training set. The point is loss
is just a number. In classification, we'll use
something called cross entropy. But it still gives
us just a number. And this is gradient descent. As soon as you
can plot your loss as a function of your
variables-- linear regression, there's two-- the
slope and intercept. Deep neural networks,
there might be a million. But the concept is identical. We're almost done. Because our loss
quantifies how bad of a job we're doing, if we
minimize the loss, that means we have a good model. So we want to go down the hill. Deep neural networks don't have
a global minimum like this. Or they're not convex like this. This is the special case. But we'll get to some minimum. There's a concept in
calculus called the gradient. And the gradient is a vector
of partial derivatives that points uphill, which is
why the negative gradient is the direction that
points downhill. The good news is if you haven't
taken a calculus class in 20 years, and you don't
remember what that means, you can sort of
understand it intuitively. So loss is a function
of our variables. The gradient looks at each
variable independently. So let's just look at b. Our variables are just numbers. There's only two things
we can do to a number. We can make it bigger. Or we can make it
smaller by some amount. If you forget calculus, you can
calculate the numeric gradient like this. For each variable in your
model, make it slightly bigger. Recompute your loss. Then make it slightly smaller. Recompute your loss. Figure out which way
makes your loss go down. Well, actually in this case, the
way that makes your loss go up is the gradient. And the negative
gradient is the direction that makes it go down. So you wiggle each one a
little bit and recompute it. That gives you the direction. Wiggle each one a little bit. That gives you the direction. The problem with just
doing this numerically is if you have a
million variables, you have to do a million
forward passes of your data. So this is really slow. If you remember
tricks from calculus, you can get it in
time that's linear in the size of the
number of nodes on the computational graph. So basically, calculus is a much
faster way to get the gradient. But the point is, regardless
of how you compute it, the gradient descent
step is easy. That just means
apply the gradient. Literally, take a step. So, wiggle your
parameters a little bit. Get the gradient
again, and again, and again, and again, and again. And so neural networks
are trained identically. All right. Really, really
quickly, I just want to look at some of the
building blocks of these DNNs. So basically, you'll see
this like a billion times. There's cartoon
diagrams of a neuron, which I like to think of is
a little logistic regression unit. So here, what we have
is some input data. These could be
pixels on an image. Each pixel is being
multiplied by a weight. We sum it up. We apply non-linearity. And that gives us the
output of one neuron. I don't like this diagram. I don't like the math, either. But you can look at the math. It's a sum of the inputs
multiplied by the weights, and then a non-linearity. But let me show
you a diagram that makes a little bit more sense. So here's the way I like
to start thinking about it. So here's a diagram that
corresponds to that. Here's the diagram
of our little neuron. And here's what's
happening when it actually computes on some data. So we have an image. Let's pretend this
just has four pixels. And we'll pretend
it's black and white. Ignore the colors. The flattened layer that
we've been working with unrolls that image
into an array. So after we flatten it,
here's the pixel values from that image. Here, we have four pixels. So we have four weights. I ran out of room. So there should be
four inputs up there. But I just drew three. What we do is we
do a dot product of the weights and the inputs. We add a bias. And we get a result. So what a single neuron-- we haven't done
non-linearities yet. What a single neuron
is doing is giving you a score for something. And you can think of this
neuron as telling us how plane-like is that image maybe. What's nice is we
can start adding-- see how this is already
starting to look like a little neural network? Now instead of one neuron,
we have a dense layer. All we had to do to
get a dense layer is we added one more output. Actually, we could
have had a dense layer. In Keras, you could have written
exactly what you see here. Is model dot add dense
one for one neuron. This would be dense two. And what we have here is
now we have two outputs. Adding a second output
because it's fully connected, it means we've added a
second layer of weights. And what's really nice
about this is instead of a dot product, we're
doing a matrix multiply. So the forward pass
of one dense layer is one matrix multiply,
which is really, really nice. And other cool things
you can do, too. Here, we're multiplying. We're classifying
one image at a time. We can also, still with
one matrix multiplied, classify multiple
images at a time. And what we've done here is
we've added a batch of images. And here we have two. And what I'm trying
to show you here is still a matrix multiply. But now we get scores
of multiple images at the same time. And so we're classifying
two images at once. And this is what a
dense layer is doing. If you look at model.fit-- let's see if this works. I'm not connected. That's why. You can look at
the documentation for all these little different
methods we're calling. And you'll see that one
of the parameters you can set inside model.fit
is the batch size. And the batch size
in TensorFlow, if you're using
these Keras APIs, defaults to 32, which is fine. When you're doing
gradient descent, the larger your batch size,
the more accurate of an update you're going to make. But the slower it is to compute. A batch size of one would
be one example at a time. That's stochastic
gradient descent. A batch size equal to the
length of your training set would be batch gradient descent. And what everyone
does in practice is a mini batch, which is
a number greater than 1 and less than the
size of your data set. 32 is usually what you want. But the point is
matrix multiply. To get a neural
network from that, you just need one
more dense layer. So you need a non-linearity. And you need a dense layer. The intuition for
the non-linearity, I guarantee you some of you
have a much better sense of this than I do. I don't like this. But I'll show you a
demo of how it works. So to get to a neural network,
we just need two more things. We have our matrix multiply. We have a non-linearity. And we have another dense layer. There are a bunch
of non-linearities. A lot of you have an
awesome math background. If you're multiplying
a series of matrices without the non-linearities,
that reduces the multiplying just one matrix. So there are a stack of
different activation layers you can add. Some of the ones originally
used were things like sigmoids. Now a good default
would be ReLU. I know these are tiny diagrams. Sigmoid looks really nice. It takes a number
and squashes it to be between 0 and 1, which
makes a lot of intuitive sense but has really bad properties
for gradient descent. And the bad
properties when you're using sigmoid
activations-- and these weren't understood for a while. If you have a very large
value or very small value going into a sigmoid,
and you think about the derivative
of a sigmoid, it flattens out
towards the extremes. So this using sigmoids can
cause your gradient descent to run very, very slow. Later, it was found
that ReLU, which-- it looks a little silly. It's basically an on/off switch. Will make your models
train much faster. So the good news is applying
the non-linearities is simple. There applied piecewise. So here, maybe we've
done our dense layer. We've done the matrix multiply. And if these are the scores we
got, we can apply ReLU to them like this. It's just going to be, if
it's less than 0, it's 0. If it's greater than 0, it
just passes through unchanged. So that's how you
would apply ReLU to the output of
your matrix multiply. And this in Keras would
be Keras dot add layers, dense 3, activation equals ReLU. And then to get
a neural network, you just need a single, one
more dense layer on top of that. Basically, on these
slides, if you want to poke around with why
you need the non-linearities, I linked some code that
trains the deep neural network without non-linearities
and tries to classify this data set. If you delete the
non-linearities, it gives you a linear
decision boundary. If you add the
non-linearities, it gives you a nonlinear decision boundary. And I'll show you a demo
of this in a second. But basically, the idea is
if you forget your ReLUs, you replace-- here's some DNN. If you stick these ReLUs, if
you write none instead of ReLU, this has the same power as
this network right here. The intermediate
pairs to nothing. And let me show you
a quick demo of this. There's a cool website. It's playground.tensorflow.org. And this is a little neural
network running in the browser. And this was before
TensorFlow.js, just FYI. And it's sort of a funny thing. It's awesome, and
really powerful, and horribly documented,
and can be a little bit hard to understand. But basically, if I
delete the hidden layers, we're looking at a single
dense layer or one neuron. And if we pick a linear
data set, and I hit Play, we can classify the data
set with our neuron. If I have a nonlinear data set--
here we have these two circles, blue dots in the center,
orange dots outside-- we can't split the thing. We can't draw a
line to split them. If you add a hidden layer,
now we have a neural network. And the hidden layer will
do feature engineering. And it will-- I don't have a slide for this. But we'll just skip it. There's a trick you can use to
classify a nonlinear data set with a linear layer. And that's if you do
feature engineering. But it doesn't matter. The neural network is
doing feature engineering to let us classify the data. If you delete the
activations, though, so if we switch the activation
to linear, which is none, our neural network can't do it. And so you have to have the
activation functions to have the hidden layers do something. All right. Really quickly, and then
we'll do some more code. There's just two concepts that
I wanted to briefly mentioned, because we have alphabet soup. So the output of a dense
layer is just some scores. And after we apply the
activation function, we still just have scores. Usually, when you're doing
classification, what you want are probabilities. So there's a
function that you'll see at the end of your
networks called softmax. And softmax take scores. And it returns a
probability distribution. So that's what softmax is doing. The other thing you'll see
is, in linear regression, the loss function
is squared error. When you're doing
classification, the loss function is
usually cross entropy. And all I wanted to say right
now, when you see the term cross entropy,
what you're saying is compared to
probability distributions. So softmax gives us scores. And we need to compare
those scores to the thing that we wanted. So this is called
a one hot encoding. And let's say we
were classifying this image of a bird. And maybe there's 10 possible
outputs for the image. Our label, or the value that
we want for the bird, is-- let's say 2 corresponds to bird. So we have a 1 here and
zeros everywhere else. That's the probability
distribution we wanted. This is the probability
distribution we got from making a
prediction with our model. Cross entropy we'll compare
these and return a number. So it's another loss
function, just FYI. All right. So here's another notebook. And then after this,
we'll do convolution, which is much more interesting
than these dense layers. So this is another
notebook where you're going to write a neural
network for a fashion MNIST. And this is dense layers still. The link is ijcai, this
time, underscore one dash a. Let's take 10 minutes. And you can hack on that. By the way, the
goal that was not to give you all the details
of softmax or cross entropy. It's just so you know,
OK, that's ballpark what those terms are trying to do. And you can go from there. Oh, I just wanted to mention
so you don't get stuck on this. The goal of this
notebook was just to briefly introduce two things. The goal this notebook is
briefly introducing TF data. So you're seeing instead of-- when you're
pre-processing images, Keras has awesome, really
thoughtful, easy to use pre-processing utilities. Things like flow from
directory, data augmentation. They're wonderful and awesome. They work in TensorFlow 2 also. The goal of this notebook
is to show you a lower level way to do it, which is why we're
using things like data set map and writing your own
pre-processing functions from scratch. Just in case you're
stuck, the first step is to batch the data. And the way you batch
the data is just nice. If you're seeing you can't-- ah. So my friend wrote
this in Google Drive. If you can't edit
the notebook, it's because it's not on GitHub. You have to click on
Open in Playground. And that will give
you a copy of it. But let me just show you
how to do the batching step. Just for step one, you can
just do dot batch and then the batch size. So that's all you need. All right. So continuing our warp speed
intro to deep learning. So, convolution. Basically, you'll
hear a lot about CNNs. And convolutional neural
networks are way more-- they're much better suited
to image classification than dense networks. And I'll briefly explain why. So first of all, convolution. Not a deep learning concept. And you'll see this a
lot in deep learning. I know some of you have
an electrical engineering background. You'll know way more about
convolution than I ever will. In deep learning, we take
concepts from other fields. And we use [INAUDIBLE]
kind of remedial way. So first of all, convolution. Not a deep learning concept. And I have some code
that I wrote in PSI Py. And we're going to convolve
over a picture of an astronaut to detect the
edges on the photo. And quickly, does
anyone know who the picture of the
astronaut is in PSI Py? Who got built into PSI Py? What do you have to do
to become part of PSI Py? Anyway, that's Eileen Collins. And she was the first woman
to command the space shuttle Columbia, which is where
I stole the slide from. So anyway, the way we're going
to detect edges on Eileen is we're going to use
a filter or a kernel. Things in deep
learning often have like five names for no reason. So we're going to use
a filter or a kernel. The brief idea is there are
nine numbers, eight of which are negative one, one
of which is eight. Same number of negative
ones as the eight. If we put the kernel on
top of the image, and we do the dot product of the values
in the kernel with the pixels. And there's just code in PSI
Py you can look at later. Let me show you
what I mean by that. So here's our image. And here's our
kernel or our filter. And the way we can convolve
or we slide over this image is we stick the filter
on top of the image. We take the dot product of the
filter and the image values. And we write it in
the output image. And then, convolve
literally means slide. Slide dot product output,
slide dot product output, slide dot products output. And so we get an output
image by convolving. And in CNNs, is the filter
values are learned exactly like parameters inside
dense layers are learned. So they're learned
by gradient descent. They start life as
small random numbers. And what's interesting
about convolution is if you have the right
numbers for the kernel, you get really powerful things. So this is an edge detector. And this is the way
Photoshop works as well. It's convolution to detect
edges, to blur images, to sharpen images. The difference is
in Photoshop, they have these really nice kernels
that are very carefully hand designed. This is like the crappiest
one you can write. But it works. And the difference with
convolution in dense layers. This is already much more
powerful than a dense layer. So with just nine
numbers, we can find edges anywhere on the image. To do that with a dense
layer, a dense layer would have to separately
learn to detect edges at every location in the image. So this little thing has the
same power as like dense 1,000. So, much more efficient. It's slower because we have
to convolve it and slide it around the image to do the math. But it's much more
efficient in terms of the number of parameters. So this is a big deal. What's great is in
deep learning-- well, first of all, here's how you use
convolution inside TensorFlow. You can write a little
convolutional layer. And I'll explain
what this means. Here, we have some layer
that's going to take an input image as input. The input image is going
to be 10 by 10 by 3, meaning it has three color
channels-- red, green, and blue. Here, pulled it out of a hat. We're going to learn a
filter that's 4 by 4. The larger your filter is, the
more sophisticated [INAUDIBLE] detect but the slower they are. Common filter sizes
are not 4 by 4. They're usually
3 by 3 or 5 by 5. I stole these slides
from a friend. I had to change the kernel
size to match the slides. And we're going to
learn four filters. And I'll show you what
that means in a second. So here's convolving in 3D. And this becomes very
powerful very quickly. So, convolution in 3D. Instead of having a 2D filter,
we now have a 3D filter. And already, this gets
a little bit harder to wrap our heads around exactly
what this filter is doing. But it's basically looking at
every color channel separately. The good news is we can
convolve in exactly the same way that we can convolve in 2D. So we stick the
filter over the image. We take a dot product here. And we write that down
as the output value. And I'm skipping
things like padding, and stride, and stuff like that. But basically, if you
do a lot of sliding, take a lot of dot products, you
end up with an output image. What this is called, this
is an activation map. So this is showing
you the regions where the filter was
most strongly activated. And that just means the
dot product was high. So if it was an edge
detection filter, this would be the
locations of the edges. What's nice, again, these
filters are learned. This is also an image. There's no reason that we can't
just stick this in map plot live and display it
as an image like we did with the results of
convolving over [? alib. ?] And so you can visualize very
easily exactly what the output is of all these filters. So that's a nice property. And then it gets powerful. If we add another filter,
we get another output image. And all the filters are learned,
random weights initialization. So hopefully, they'll be
detecting different things. And here's what's cool. I guess I deleted the
slide accidentally. But you can imagine if we
had 4 output filters, or 10. Let's say we had
4 output filters. That would mean
we've gone from a 10 by 10 by 3 image to a 10
by 10 by 4 output image. So we've left color space. And we've entered
activation space. And the hope is that
this will learn edges in some orientation,
edges in other colors, different kind of colors. So we're getting
maps describing where features are in the image. And it starts getting
powerful very quickly. It's when you add a second
convolutional layer. And the important thing is
convolutional layer one. If you're a filter
in this layer, you have to look at pixels
and compute features. But if you're a
filter in this layer, you get to look at the features
this guy already computed and compute features of them. So if these are
edges, maybe you'll learn to detect shapes,
which is really cool. And here's how this works. So let's say in our first
convolutional layer, we learned four filters
pulled out of a hat. The next convolutional
layer looks through all four activation
maps of the previous one. If we had learned 32 activation
maps in the last layer, these filters would look
through all 32 of them. So these filters are
really, really powerful. Basically, they're taking
dot products of features, different types of features. The things they can
compute are very powerful. And they get powerful very fast. But again, convolution works
in exactly the same way. Every filter produces a
single activation map. And if we had eight of them,
we'd get eight activation maps. So what happens is, basically,
the image gets deeper. And then, as I've
drawn it here-- I didn't have time to talk
about things like max pooling. But there's ways you can
make the image-- basically, this is a big chunk of image. And it's slow to convolve over. As a way to speed this
up, you might see things like max pooling layers. And what max pooling
layers are, they reduce the width and
the height of the image. But they leave the
depth unchanged. So basically, what
max pooling will do. One funny thing you'll
learn, by the way, if you start
teaching this stuff. There's like two or three-- more than two or
three- but there's a small number of people that
make really excellent diagrams. And every other class in
the world steals them. The best example of this. I'd say like 95 percent
of the classes I've seen have borrowed this diagram,
which is written by-- you've all seen it, yeah? Which is written by Chris Olah. And it's the same thing with
max pooling from Stanford. But anyway, what max pooling
does is just taking-- this is max pooling of two. So what we're saying
is this is too hard to process computationally. We want to hack to
make it smaller. What we're going to do is
for every 2 by 2 region, we're just going to
copy out the strongest activation to the output. So this reduces the
image size by 75 percent. It's lossy, but whatever. That's max pooling. And then, yeah, we talked
about that earlier. What I want to do is talk
about Deep Dream really quick. Well, there's a couple of
things you can do with this. Anyway, one question
you might have is, by the time you get to layer 17,
what are these filters actually responding to? Before we write a CNN, let
me just talk about a couple of things you can do with this. So here are three things
you can do with CNNs. The first is you can
write one from scratch. And that's the next exercise. And that's writing a model
Keras sequential, convolution, max pooling, convolution,
max pooling, [? depth. ?] And that's great. The other thing you can
do is transfer learning. And this is a really,
really powerful concept. So, the idea of
transfer learning. Usually, machine learning, you
have a small amount of data. Let's say your friend in
the past had a lot of data. And maybe she trained
a convolutional network on ImageNet from Stanford. So ImageNet, the moving pictures
in 1,000 different classes takes a day to
train a model on it. So let's say she trained it. And then you wanted to reuse
her model to train your own. Instead of starting from
scratch, what you could do is, let's say this is her model. And this dense layer at the
end is classifying things from ImageNet. So, cats, dogs, snakes,
peacocks, whatever. Let's say you have
Hondas and Toyotas. To do transfer learning,
you delete this dense layer. You keep the rest of the CNN
that she previously trained, unchanged. You add your own dense layer
and outputs just for the classes that you care about. And then you relearn
just this dense layer. But you leave the
convolutional base unchanged. And the idea here is you use
her CNN as a preprocessor. So it takes an image. It gives you good
features for the image. And then you learn a dense
layer using those features. That's called transfer learning. It's a really, really
interesting idea. The idea is using
knowledge that you've learned on a previous
task on another task. You might know of
other examples of this. The only one I know that works--
well, that's no longer true. This works really
well for images. And it's starting to
work really well for NLP. But that's very, very recent
with models like BERT. But I bet there's more
more potential here, too. The third thing you
can do with convolution is trying to understand
what these filters do. So basically, let me see what we
have here because of the time. I'm just going to talk about
Deep Dream for a minute. And then I'll give
you some exercises you can do to write a CNN and
then to do transfer learning. So here's the idea
with Deep Dream. All right. Has anyone seen Deep
Dream Before Does anyone know why Deep Dream exists? Was the goal of Deep Dream
like, let's smoke too much and generate psychedelic
images from neural networks? So Deep Dream is-- so in a really hand wavy
way been like, trust me. We get this magical
feature hierarchy. It's going to be great. And Deep Dream is a way to
actually show that this exists. So it's a way to investigate
the representations learned by a neural network. So basically, let me
show you the results and then explain what they are. Just in terms of terminology. By the way, when you add these
layers, you can name them. So if we wanted to, I could
put a parameter, comma, name equals layer
1, or whatever. This layer has four filters,
each of which are 4 by 4. And let me show you what the
authors of the Deep Dream paper-- before I
go into Deep Dream, let me show you what the
filters are learning to detect. So each of these is one filter
in the first convolutional layer in a neural network. And these images were produced
by starting with random noise and modifying the noise
until the filter is maximally excited. So this is an image. If you were the first filter in
the first congressional layer, and you saw this, you
would produce your highest possible activation. And in layer 1, we're
seeing that filters are responding to different colors. Right? This layer probably, by the
way, looks like com 2D 32 3 by 3 or something like that. So there might be 32 of
these individual filters. They respond to different
colors, and edges, and different orientations. So that's the first
layer of a CNN. The names here are a little
funny in this network. But as we go deeper,
these are the images that the filters in the next
layer get really excited by. And already, they're
getting a little complex. These are like texture-y, right? As you go deeper
into the network-- I'm not going to go
through all of them. Take forever. They get more and more
complex as you go. And what's interesting
is if you start poking around really
deep, they start to look like things
we recognize. So we see like peacocks, and
feathers, and cool textures. And I don't know what
all this stuff is. The reason that we're seeing
these particular images is this model was
trained on ImageNet. And these are the
features that it found to be useful to
classify ImageNet images. Presumably, if the image had
features that looked like this, it might be a bee. And the reason that you see
these features tessallating along the image is probably
because convolution does this slide-y
operation where you apply the filter
in different regions. And then if you go
really deep, you see things that start
making sense to us, like saxophones, broccoli,
and who knows what. But anyway, let me
explain how we get these. And this is what
Deep Dream is doing. First of all, there's
two things you can ask. One is you can say let's
find the image that excites one filter from some layer. And that's what
we're doing here. Two, you can say let's
find an image that excites the entire layer. So if this is layer 5, let's
make this layer as excited as it can be. And here's how that works. It's a really,
really powerful idea. And the code is
surprisingly short. So in Deep Dream, you
start with the picture. Great. The next thing you
need is a model that was trained on a
large data set of images. And it doesn't matter
what model you use. Here, we're importing. This is transfer
learning, almost. We're importing a
model called Inception. There is a-- if you
learn more about CNNs, there's a box of famous models. There's things like BGG,
Inception, Resonant, whatever. Inception is one of them. And these are all
different architectures. And what I mean
by an architecture is, basically, when you do
model equals sequential, add dense, add dense, add
dense, that's an architecture. This would just be a
fancier architecture built with a functional API,
or the [? sub classic ?] API with whatever. But that's Inception. It's some CNN with
some fancy bits added. If you were doing
transfer learning, this line here says give me the
CNN but not the dense layer. And later, you could add
your own dense layer to this. Here, we're saying
give me the weights that we previously
learned on ImageNet. So this is a trained model. The next thing
we're going to do is we're going to take an image. And we're going to pass
it through the model. And what we want are the
activations at a certain layer. So our goal is going to
be to modify the image to get those activations
as high as possible. And as we modify the
image, presumably we'll add more things that-- whatever that layer is detecting
we want to appear in the image. So because the layers
in this model are named, we can look at the summary
and find the names we want. And then here, we're
using the functional API just to write a new model
where we pass in an image. And we get the activation
maps out of the layers. And if you pass an image
through this and you run it, you'll see a bunch of
matrices, which are literally the output of the convolutional
filters at those layers. And there's going to
be a lot of numbers. But here's how Deep Dream works. And this is kind of magical. Some of this code
is boilerplate. But you always need
a loss function. And the loss function
here is just this . We're summing up
all the activations. So if we want to find an
image that excites this layer, we want to maximize this
list of activations. So we literally sum them. Or here, we're taking the mean. But same thing. And that's almost it. Just so you know, we
updated this like yesterday. So I haven't actually seen
this brand new version yet. But I can give you the idea. So what we do. When we call this, we're
going to pass some image through a model. Inside here, we get the
list of activations. And this is the sum or the
average of the activations. We need to maximize this. And here's the insight
behind Deep Dream. So normally, when you have
a deep learning model, you adjust the weights on
your models to fit the data. Here, we're going to
leave the model alone. And we're going to adjust
the image to fit the model. So we get the gradient
or the loss with respect to the pixels on the image. And if you print this
out, these gradients will have exactly the same
shape as the image, which means you can directly
add them to the image. And we want to do
gradient ascent because we want to make the loss go up. And so there's some
normalizing code here. But the important part,
they changed this slightly. But it's right here. We're doing gradient ascent. So we're adding the gradients
to the image multiplied by a learning rate. And at every step of
this, it's amazing. That's all the code you need
to Deep Dream-ify an image. So if you picked a layer
that responds to sheep, those gradients will make your
image slightly more sheep-like, which is nuts. And there's two versions
of the Deep Dream tutorial. So the first half
of it, we tried to write the minimum amount
of code to make it work. And that produces these
slightly staticky images. The second half
of the tutorial-- that's the research inside. The second half, which is a
little bit more complicated, has different tricks to make
them really high resolution and stuff like that. But the point is I just wanted
to talk about Deep Dream after convolution because
it really proves the point. I really like it. It means this isn't BS. And these layers are actually
learning this feature hierarchy. And we can see it. And we can reuse it. And it's cool. Anyway, let's do
this for 10 minutes. I know it's fast. But the goal here is to start
writing a CNN for a data set called CIFAR 10. And CIFAR 10 is in
the MNIST family. It's a small data set. But it's color. So it's a little bit
more interesting. And there's a
reference tutorial you can look at which has background
on how convolutional layers work in TensorFlow 2. OK, so let me point you
to one or two more things. I'm going to take like two
minutes and just point you to-- we've spent a lot of the summer
working on the tutorials. So let me just point you
to some of the latest ones, just to save some time. So basically, for
transfer learning, we have two different
tutorials you can check out. And let me explain why we
have two different ones from an industry perspective. So basically, there
are two repositories of pre-trained models
in TensorFlow 2. One list of pre-trained
models, which are awesome, are these Keras application. And if you Google around
for Keras applications, there are these one
liners where you can import a lot of famous CNNs
with usually weights trained on ImageNet. So a lot of our tutorials
are using MobileNet B2. By the way, here's a really
simple line of research that's interesting. Previously, with CNNs, the goal
was how accurate of a model can we train. The new goals today are often
how small of the model can we train that's accurate enough. And the goal is to get it
fast enough to run on a phone or in a web browser. It's not rocket science. Basically, what people
do is they do experiments with different
numbers of layers. They look at like the
accuracy-speed trade-off. Anyway, so one tutorial has
these applications from Keras. They're great. The other has a
larger repository of pre-trained models
from TensorFlow Hub. And TensorFlow Hub is a more
recent collection of models. And we're working on expanding
this for TensorFlow 2. The truth is it doesn't really
matter which one you use. Sometimes, big companies
build two of everything and see which one works better. Keras applications are older. And they existed
before TensorFlow 2. They're great. Anyway, you can try either. And whichever one you
feel is easier is the one that you should use. So that's transfer learning. A really cool thing
today is GANs. And TensorFlow 2 has really,
really awesome tutorials for GANs. We've got three of them, plus
a VAE, plus an adversarial. Well, actually, let's
just look at the GANs. That is not a GAN. Has anyone worked with GANs? A couple. All right. So basically, for people
that are new to GANs, they ask a really hard question. So everything we've looked at
so far is, here's a picture. Classify the picture. The question GANs ask are,
generate me a picture. And the goal is to generate
a picture that looks real. And the challenge
with generating things with deep learning is that
we need a loss function. So everything we
do in deep learning is we're doing gradient-based
optimization against some loss function like squared error,
or cross entropy, or maximize the activation of some layer. It's hard to get a loss
function for generating cats. And the way GANs
work is research from Ian Goodfellow in 2014. And what Ian realized is
we can get a loss function for generating images for free. And we already have it. It's an image classifier. And so if you train an
image classifier to say, is this image of a
cat real or fake? That's just a standard
convolutional network. You can train a second CNN
to generate images of cats. And you can train them
against each other. And so you have this game
where you have a generator and a discriminator. And they're trained in parallel. And basically, over
time, the generator learns to generate more
realistic pictures of cats. And the discriminator
becomes better and better at telling real cats
apart from fake cats. Over time, they hopefully reach
equilibrium, at which point you can generate
pictures of cats. And this tutorial here is
the minimum amount of code you need to train a GAN or
generative adversarial network to generate images of MNIST. And that is a visualization
of the MNIST images being generated over time. And if you look at this, it
will look very, very similar. There's two chunks. Chunk 1 is the discriminator. And if we had more
time, this would have looked almost
identical to the image classifier you would have
written on the last exercise. The discriminator is just
a run of the mill CNN. The trick is the generator. And when you're new
to deep learning, you'll start looking
at code like this. And you'll recognize
some layers. And you won't recognize others. So let me just walk through
what some of these layers are. And by the way, the best
way to go through these. The papers are linked at
the top of the tutorials. If you read the paper, you
track the code at the same time. It's much, much easier
than just the paper. So basically, you'll
see layers like dense. You've seen that. Leaky ReLU is a friend of
ReLU with just slightly different properties. This would work
fine with ReLU also. You haven't seen
batch normalization. And you haven't seen
these com 2D transpose. Let me see if I have slides on
com 2D transpose really quick. So one challenge
with GANs is we don't want to generate the
same image every time. Otherwise, the
discriminator would just learn that that's
the fake image. So we need to randomly
seed the generator. So the way the
generator and GANs work is, here's some random numbers. Use these to parameterize
the image that you generate. Maybe the first random number
tells you how one-like it is. And the second one tells
you how two-like it is. And what the GAN has
to do in the generator, it has to go from a list
of numbers to an image. And we usually do up
sampling to do that. Com 2D transpose is
an up sampling layer. There's two ways
to do up sampling. One is you can just double
the size of the image and average the pixels. Two is we can do a
learned up sampling. And what I'd recommend doing
as you're going through layers, you see layer like this. Take a few moments
or a day or two. And dig around. And try and understand
what it's trying to do. And so when I was going
through convolution 2D transpose and like,
what the hell is that. And so what I usually do is-- these are slides from
the summer workshop. What I usually do is
try a simplest mini example of the layer and
just work out an example. [INAUDIBLE] they look. So this is
convolution transpose. And here's a quick
example of how we can go from a small image to
a larger image with a learned up sampling. So this is a lot
like convolution. Here's our filter. And the basic idea. The details aren't important. I just want to show
you that it's a thing. You would take the small image. This is how the Keras com
2D transpose layer works. You take a the image. And you use it to
parameritize the filter. So instead of the dot
product, it's the image value. Multiply it against
all the filter values. And you write that down
on the output image. And then, just like
convolution, you slide. So we slide again. We multiply that by the filter. We write down the output
values and the summit. Slide again. We slide again. But the point is that's
all the layer is doing. It's complicated but not
so bad if you have the time to go through a small example. Com 2D is a learned up sampling. The other layer in there
that we haven't talked about is batch normalization. And let me just briefly
explain the idea there. So in a lot of the code
you'll look at for MNIST, if you're training
an image classifier, one of the first things we
do is we normalize the data. So we import some images. And usually, the pixels
range between 0 and 255. And the first thing you'll
see in a lot of tutorials is we divide by 255, which makes
them range between 0 and 1. The reason we do
that, briefly, is that basically
neural networks don't like large numbers as input. And there's
different reasons why they don't like large
numbers, one of which is if you remember the slides
from a dense layer, the input value multiple by weight. If we have a very large input
value or a very large weight, we could get overflow, numeric
overflow with floating point problems. And it can have bad properties
for gradient descent. So we normally normalize the
numbers to be between 0 and 1. Here's the insight
between this amazing layer called batch normalization. And let me explain
what's happening and why this is important. And you'll also see that
a lot of the research today in deep learning
is not rocket science. It's just very
early in the field. So here's our
beginners tutorial. And we import MNIST. We normalize its
[? reproduction. ?] We import MNIST. We normalize it to
be between 0 and 1. This means if you are
the first dense layer, if you are this guy,
all the input values are between 0 and 1. You learn to wait. However, if you're the
second dense layer, your input values are not
necessarily between 0 and 1. They're the outputs of whatever
this previous dense layer has produced, which
means your job is harder. So this dense layer just has
to learn weights for that fixed distribution. But the distribution
coming into this guy is changing, which makes the
job harder than it needs to be. So what batch
normalization does, it's a layer that you
would add right here. And if you wanted
to, right here. And the basic idea is
it's a normalizing layer. And there's a lot more
details you can read about. But the basic idea is-- let
me see if I'm awake enough to go through this. Here's some features
coming into a layer. And these are examples. And what we're doing
is batch normalization computes the mean and standard
deviation of each feature. And it just normalizes it before
it goes into the next layer. So batch norm,
basically, in a nutshell is, let's re-normalize the data
to make the distribution going into the layer change slower. So it can speed
up learning a lot. Another layer you'll
see is Drop Out. in DC GAN. And here's Drop Out. The good news is-- let's see. You see these Drop Out layers. Drop Out is a really,
really nice layer. And it's easy to use. Drop Out is a great way
to prevent overfitting. And here's the basic idea. So we have some cartoon
network full of dense layers. And lets say we're overfitting. We're memorizing
the training data. What Drop Out does is
it randomly deactivates, on every batch, a
subset of the neurons. And it does that by setting
their activations to zero. So Drop Out basically says
this network is too powerful. Let's randomly turn off a
bunch of neurons at every step. And the reason that this
helps prevent overfitting, it makes it harder for
the network to learn. And the idea is that
because it can't rely on any individual
neuron being on at any individual
step, so it has to learn redundant representations. So it has to learn
different ways of detecting the same feature. So Drop Out is a layer
to prevent overfitting. What's cool is if you
learn about Drop Out, it was invented by Jeff Hinton. And Jeff Hinton had this
really cool thing on Reddit where somebody asked him what's
the intuition behind Drop Out. And this is what Jeff said. So basically, Jeff was saying-- I think he must be
a really nice guy because he was trying to make
friends with his bank teller. And he was having trouble making
friends with his bank teller because the teller
kept being changed. And basically, he
asked the bank, why are you changing
the bank teller? And it's so he can't
defraud the bank. And so because he can't
rely on any individual bank teller being there at
any individual day, you can't form a friendship. You can't come up with a
conspiracy to defraud the bank. And Drop Out, in the same
way, prevents neurons from always being present. So that's the intuition. I don't think like this
when I go to the bank. All right. Let's just-- I just want to
point you to one more thing. Then, we're going
to play a game. And then, we're going to stop. If you have time, there's
two awesome new GAN tutorials you can go through
and read the papers. The reason I like these is
it's complete code that works. And it works with a
click, which is nice. The first is Pix2Pix, out of
Berkeley, which is beautiful. This is a conditional GAN. And the goal here
is not generate me a random image that looks real. It's generate me an image that
looks real, that also resembles an input image I give you. So this is an input image for
the building facade or facade, not sure. This is the building that
image actually corresponds to. And this is what's
generated by Pix2Pix. And so this is an image
with similar pixel values to this image that
the discriminator isn't able to distinguish
from real or fake. And if you start looking
through these GANs, the main thing to look
at is the loss function. And so if you want to
understand the evolution from DC GAN, which
is MNIST, to Pix2Pix, look at the loss function. And the main difference
in the loss function. The loss function for DC again
is trick the discriminator. The loss function in Pix2Pix
is trick the discriminator and minimize the L1 distance
between the input image and the output image,
which forces the output image to look similar
to [? right. ?] That's the main difference. And then the same
reasoning applies to cycle GAN, which is
the latest when we have. And cycle GAN does
unpaired image translation. And so with Pix2Pix, you have
to have paired training data. So, a facade building, a
map image, satellite image. There's lots of
things you might want to do GANs for that you can't
get paired training data for, like day to night. Even day to night is hard
to get paired training data. If we took a picture of the
NBL at night and in the day, things change. Cars move around. People move around. So it's hard to get. And of course, there isn't
a paired training data for horses to zebras
because it doesn't exist. But cycle GAN can do this. And the insight
with cycle GAN was you don't have to have
a one to one mapping. What you need is a directory. So if you have a directory
of horse pictures and a directory of
zebras pictures, you can exploit supervision
at the level of sets. So the cycle GAN was
a really cool thing. All the code is there. All right. Let's stop with that. Let's play one
game really quick. And I'll point you to two
games that, if you're teaching, they can be fun to help
keep students engaged. And there's a point to them. So could I have a
quick volunteer? And this this person
should be proud of their artistic ability. Very good artist,
which I am not. Thank you. Come on up. Has anyone seen
Quick Draw before? So this is great
with kids and adults. Anyway, so Quick Draw, by
the way, just got way harder. And I'll explain why in a sec. So if you could do a-- you've seen this. Go for it. So let's try and do like
two or three quick draws. OK, OK. So let's try and
do two or three. Oh well. I hope this isn't
blazingly loud. Draw shorts. QUICK DRAW: I see music note. Oh, I know it's shorts. JOSH GORDON: We don't--
that was amazing. So we don't have audio. But usually it speaks to
you as you're playing it. So I see baseball. I see shorts. So try it again. That was great. QUICK DRAW: I see shoe, or
suitcase, or square, or camera, or stereo. I see stove. STUDENT: Stove? JOSH GORDON: Yeah, it's a stove. QUICK DRAW: Oh, I know. It's-- JOSH GORDON: Yeah, there you go. So let's do one more. This is actually really good. Cannonball. QUICK DRAW: I see line, or
rainbow, or potato, or peanut, or pond. I see watermelon or steak. Oh, I know. It's cannon. This is actually surprisingly-- OK. STUDENT: Do I keep going? JOSH GORDON: Keep going. STUDENT: OK. QUICK DRAW: I see nose,
or line, or pond, or pool. I see skateboard, or
sandwich, or hockey puck. Oh, I know. It's hamburger. JOSH GORDON: Let's do one more. You might've set the new
record for Quick Draw. QUICK DRAW: I see
line, or diving board, or circle, or peanut. I see potato. Oh, I know. It's steak. JOSH GORDON: All right. We're going to stop there. So, nice job. Thank you very much. Thanks. [APPLAUSE] All right. So the first thing I have to
tell you about Quick Draw. If you're teaching a class,
MNIST is boring as hell. But it's a good place to start. A good homework for the
students is Quick Draw. So let me point
you to some code. If anyone wants the URL, you
can grab this screenshot. And there's probably official
versions of this, too. But what this is. It's a little Python
file you can use to make a Quick Draw data set. So Quick Draw, which your
images are now a part of. Quick Draw has an academic data
set of probably like 20 million plus Quick Draw
diagrams at this point. And this code, you can
pick which class names. You can say, yo, give me all
the planes, cars, trucks. And you can say
how many images you want, anywhere from like five
all the way up to millions. And what's nice about this is
students, when they go home, they can get some experience
like training a model on a large amount of data. And it doesn't have to
be a blocker because they can select how much they want. So it's a nice thing, too. Yeah, and this will
walk you through how to get the images out of Quick
Draw and stuff like that. The other thing I have to
say about Quick Draw, which is really interesting. If you look at these drawings,
they're not just pictures. But they're sequences
of brushstrokes. And so what's cool is these
are different elephants from the Quick
Draw elephant set. And what's cool is
the different colors and different brushstrokes. I don't know the order. But that's too. And what's cool is there's
some really good research. And given that we have this
database, what else can we do with these brushstrokes? And you can use RNNs,
which are usually used to generate text
and stuff like that. But the insight with
David [? Hal's ?] group, you can generate Quick Draw
images using the [? R Net. ?] And there's a really
cool game for this. It's called Sketch RNN. Sketch RNN is an RNN that's been
trained on the Quick Draw data set. And what's cool is you can
pick an image from Quick Draw. So if we pick-- I don't want. To. We'll pick penguins. It's the marine biology, right? So it goes like that. And then you start
drawing a penguin. And then you stop. Sketch RNN attempts to
auto complete your penguin. And it looks silly. But it's super impressive. You think about how
hard this is to write. And so basically, what we're
seeing is, of the people in the Quick Draw
data set that started drawing a penguin in
the way that I did, these are the brushstrokes
that might follow. And it's kind of cool. I'm not sure this will
work with penguins. But maybe people start
drawing the beak. And so it's really cool. And there's a surprisingly
large number of images that you can draw. So this one is not as
immediately actionable. You can show students
the Quick Draw data set. Then they can go
train in classifier. This one is more of like,
hey, FYI, this is super cool. All the code for this is online. It's just someday
I would love for us to have a short tutorial. And the reason I
wanted to mention this is if you think about how
this generates images, here's some research. This is very
different from GANs. So GANs synthesize these
beautiful photorealistic images pixel by pixel. But that's not how
people draw, right? If you start drawing
a scene, you're not going to draw
it pixel by pixel. You draw in brushstrokes. So this is an
RNN-based solution that learns to draw images
brushstroke by brushstroke, which is very, very different. Anyway, that's all I got. So basically, here's some
tutorials you can look at. Also, for people
that are teaching, here's three book
recommendations. The first two books
are not academic. They're like 40 bucks. These are how you
do the thing books. How do you train an
image classifier? How do you train
a text classifier? How do you make an RNN work? They're both great. If you get the first book, only
get the TensorFlow 2 version, which is in
prerelease right now. So, the second edition. The deep learning
with Python book. This is a Manning book. It's by Francois
Chalet, who wrote Keras. Everything from this book
works in TensorFlow 2 just by changing an import. And then if you want
a textbook, it's free. It's Ian Goodfellow's
Deep Learning book. This is a little bit-- it might
be instructive for some of you. I struggled with this a bit. It was really hard. To me, this is more of a
really great reference. Yeah, that's all I got. Thanks a lot. Can answer any questions? Or yeah, thanks. [APPLAUSE]