- Okay, let's get started. Alright, so welcome to lecture five. Today we're going to be getting
to the title of the class, Convolutional Neural Networks. Okay, so a couple of
administrative details before we get started. Assignment one is due Thursday, April 20, 11:59 p.m. on Canvas. We're also going to be releasing
assignment two on Thursday. Okay, so a quick review of last time. We talked about neural
networks, and how we had the running example of
the linear score function that we talked about through
the first few lectures. And then we turned this
into a neural network by stacking these linear
layers on top of each other with non-linearities in between. And we also saw that
this could help address the mode problem where
we are able to learn intermediate templates
that are looking for, for example, different
types of cars, right. A red car versus a yellow car and so on. And to combine these
together to come up with the final score function for a class. Okay, so today we're going to talk about convolutional neural networks, which is basically the same sort of idea, but now we're going to
learn convolutional layers that reason on top of basically explicitly trying to maintain spatial structure. So, let's first talk a little bit about the history of neural
networks, and then also how convolutional neural
networks were developed. So we can go all the way back
to 1957 with Frank Rosenblatt, who developed the Mark
I Perceptron machine, which was the first
implementation of an algorithm called the perceptron, which
had sort of the similar idea of getting score functions,
right, using some, you know, W times X plus a bias. But here the outputs are going
to be either one or a zero. And then in this case
we have an update rule, so an update rule for our weights, W, which also look kind of similar
to the type of update rule that we're also seeing in
backprop, but in this case there was no principled
backpropagation technique yet, we just sort of took the
weights and adjusted them in the direction towards
the target that we wanted. So in 1960, we had Widrow and Hoff, who developed Adaline and
Madaline, which was the first time that we were able to
get, to start to stack these linear layers into
multilayer perceptron networks. And so this is starting to now
look kind of like this idea of neural network layers, but
we still didn't have backprop or any sort of principled
way to train this. And so the first time
backprop was really introduced was in 1986 with Rumelhart. And so here we can start
seeing, you know, these kinds of equations with the chain
rule and the update rules that we're starting to
get familiar with, right, and so this is the first time we started to have a principled way to train these kinds of network architectures. And so after that, you know,
it still wasn't able to scale to very large neural networks,
and so there was sort of a period in which there wasn't a whole lot of new things happening
here, or a lot of popular use of these kinds of networks. And so this really started
being reinvigorated around the 2000s, so in
2006, there was this paper by Geoff Hinton and Ruslan Salakhutdinov, which basically showed that we could train a deep neural network, and show that we could
do this effectively. But it was still not quite the sort of modern iteration
of neural networks. It required really careful initialization in order to be able to do backprop, and so what they had
here was they would have this first pre-training
stage, where you model each hidden layer through this kind of, through a restricted Boltzmann machine, and so you're going to get
some initialized weights by training each of
these layers iteratively. And so once you get all
of these hidden layers you then use that to
initialize your, you know, your full neural network,
and then from there you do backprop and fine tuning of that. And so when we really started
to get the first really strong results using neural networks,
and what sort of really sparked the whole craze
of starting to use these kinds of networks really
widely was at around 2012, where we had first the strongest results using for speech recognition, and so this is work out
of Geoff Hinton's lab for acoustic modeling
and speech recognition. And then for image recognition,
2012 was the landmark paper from Alex Krizhevsky
in Geoff Hinton's lab, which introduced the first
convolutional neural network architecture that was able to do, get really strong results
on ImageNet classification. And so it took the ImageNet,
image classification benchmark, and was able to dramatically reduce the error on that benchmark. And so since then, you
know, ConvNets have gotten really widely used in all
kinds of applications. So now let's step back and
take a look at what gave rise to convolutional neural
networks specifically. And so we can go back to the 1950s, where Hubel and Wiesel did
a series of experiments trying to understand how neurons in the visual cortex worked, and they studied this
specifically for cats. And so we talked a little bit
about this in lecture one, but basically in these
experiments they put electrodes in the cat, into the cat brain, and they gave the cat
different visual stimulus. Right, and so, things like, you know, different kinds of edges, oriented edges, different sorts of
shapes, and they measured the response of the
neurons to these stimuli. And so there were a couple
of important conclusions that they were able to
make, and observations. And so the first thing
found that, you know, there's sort of this topographical
mapping in the cortex. So nearby cells in the
cortex also represent nearby regions in the visual field. And so you can see for
example, on the right here where if you take kind
of the spatial mapping and map this onto a visual cortex there's more peripheral
regions are these blue areas, you know, farther away from the center. And so they also discovered
that these neurons had a hierarchical organization. And so if you look at different
types of visual stimuli they were able to find
that at the earliest layers retinal ganglion cells
were responsive to things that looked kind of like
circular regions of spots. And then on top of that
there are simple cells, and these simple cells are
responsive to oriented edges, so different orientation
of the light stimulus. And then going further,
they discover that these were then connected to more complex cells, which were responsive to
both light orientation as well as movement, and so on. And you get, you know,
increasing complexity, for example, hypercomplex
cells are now responsive to movement with kind
of an endpoint, right, and so now you're starting
to get the idea of corners and then blobs and so on. And so then in 1980, the neocognitron
was the first example of a network architecture, a model, that had this idea of
simple and complex cells that Hubel and Wiesel had discovered. And in this case Fukushima put these into these alternating layers of
simple and complex cells, where you had these simple cells that had modifiable parameters,
and then complex cells on top of these that
performed a sort of pooling so that it was invariant to, you know, different minor modifications
from the simple cells. And so this is work that
was in the 1980s, right, and so by 1998 Yann LeCun basically showed the first example of applying backpropagation
and gradient-based learning to train convolutional neural networks that did really well on
document recognition. And specifically they
were able to do a good job of recognizing digits of zip codes. And so these were then used pretty widely for zip code recognition
in the postal service. But beyond that it
wasn't able to scale yet to more challenging and
complex data, right, digits are still fairly simple and a limited set to recognize. And so this is where
Alex Krizhevsky, in 2012, gave the modern incarnation of
convolutional neural networks and his network we sort of
colloquially call AlexNet. But this network really
didn't look so much different than the convolutional neural networks that Yann LeCun was dealing with. They're now, you know,
they were scaled now to be larger and deeper and able to, the most important parts
were that they were now able to take advantage of
the large amount of data that's now available, in web
images, in ImageNet data set. As well as take advantage of the parallel computing power in GPUs. And so we'll talk more about that later. But fast forwarding
today, so now, you know, ConvNets are used everywhere. And so we have the initial
classification results on ImageNet from Alex Krizhevsky. This is able to do a really
good job of image retrieval. You can see that when we're
trying to retrieve a flower for example, the features that are learned are really powerful for
doing similarity matching. We also have ConvNets that
are used for detection. So we're able to do a really
good job of localizing where in an image is, for
example, a bus, or a boat, and so on, and draw precise
bounding boxes around that. We're able to go even deeper
beyond that to do segmentation, right, and so these are now richer tasks where we're not looking
for just the bounding box but we're actually going
to label every pixel in the outline of, you know,
trees, and people, and so on. And these kind of algorithms are used in, for example, self-driving cars, and a lot of this is powered
by GPUs as I mentioned earlier, that's able to do parallel processing and able to efficiently
train and run these ConvNets. And so we have modern
powerful GPUs as well as ones that work in embedded
systems, for example, that you would use in a self-driving car. So we can also look at some
of the other applications that ConvNets are used for. So, face-recognition, right,
we can put an input image of a face and get out a
likelihood of who this person is. ConvNets are applied to video,
and so this is an example of a video network that
looks at both images as well as temporal information, and from there is able to classify videos. We're also able to do pose recognition. Being able to recognize, you know, shoulders, elbows, and different joints. And so here are some images
of our fabulous TA, Lane, in various kinds of pretty
non-standard human poses. But ConvNets are able
to do a pretty good job of pose recognition these days. They're also used in game playing. So some of the work in
reinforcement learning, deeper enforcement learning
that you may have seen, playing Atari games, and Go, and so on, and ConvNets are an important
part of all of these. Some other applications,
so they're being used for interpretation and
diagnosis of medical images, for classification of galaxies,
for street sign recognition. There's also whale recognition, this is from a recent Kaggle Challenge. We also have examples of
looking at aerial maps and being able to draw
out where are the streets on these maps, where are buildings, and being able to segment all of these. And then beyond recognition
of classification detection, these types of tasks, we also have tasks like image captioning,
where given an image, we want to write a sentence description about what's in the image. And so this is something
that we'll go into a little bit later in the class. And we also have, you know,
really, really fancy and cool kind of artwork that we can
do using neural networks. And so on the left is an
example of a deep dream, where we're able to take
images and kind of hallucinate different kinds of objects
and concepts in the image. There's also neural style type
work, where we take an image and we're able to re-render this image using a style of a particular
artist and artwork, right. And so here we can take, for
example, Van Gogh on the right, Starry Night, and use that to redraw our original image using that style. And Justin has done a lot of work in this and so if you guys are interested, these are images produced
by some of his code and you guys should talk
to him more about it. Okay, so basically, you know,
this is just a small sample of where ConvNets are being used today. But there's really a huge amount
that can be done with this, right, and so, you know,
for you guys' projects, sort of, you know, let
your imagination go wild and we're excited to see
what sorts of applications you can come up with. So today we're going to talk about how convolutional neural networks work. And again, same as with neural
networks, we're going to first talk about how they work
from a functional perspective without any of the brain analogies. And then we'll talk briefly
about some of these connections. Okay, so, last lecture, we talked about this idea of a fully connected layer. And how, you know, for
a fully connected layer what we're doing is we operate
on top of these vectors, right, and so let's say we
have, you know, an image, a 3D image, 32 by 32 by three, so some of the images that we
were looking at previously. We'll take that, we'll stretch
all of the pixels out, right, and then we have this
3072 dimensional vector, for example in this case. And then we have these weights, right, so we're going to multiply
this by a weight matrix. And so here for example our W
we're going to say is 10 by 3072. And then we're going
to get the activations, the output of this layer,
right, and so in this case, we take each of our 10 rows
and we do this dot product with 3072 dimensional input. And from there we get this one number that's kind of the value of that neuron. And so in this case we're going to have 10 of these neuron outputs. And so a convolutional
layer, so the main difference between this and the fully connected layer that we've been talking about is that here we want to
preserve spatial structure. And so taking this 32 by 32 by three image that we had earlier, instead
of stretching this all out into one long vector, we're
now going to keep the structure of this image, right, this
three dimensional input. And then what we're going to do is our weights are going to
be these small filters, so in this case for example, a
five by five by three filter, and we're going to take this filter and we're going to slide
it over the image spatially and compute dot products
at every spatial location. And so we're going to go into
detail of exactly how this works. So, our filters, first of all, always extend the full
depth of the input volume. And so they're going to be
just a smaller spatial area, so in this case five by five, right, instead of our full 32
by 32 spatial input, but they're always going to go
through the full depth, right, so here we're going to
take five by five by three. And then we're going to take this filter and at a given spatial location we're going to do a dot product between this filter and
then a chunk of a image. So we're just going to overlay this filter on top of a spatial location in the image, right, and then do the dot product, the multiplication of each
element of that filter with each corresponding element
in that spatial location that we've just plopped it on top of. And then this is going
to give us a dot product. So in this case, we have
five times five times three, this is the number of multiplications that we're going to do,
right, plus the bias term. And so this is basically
taking our filter W and basically doing W transpose
times X and plus bias. So is that clear how this works? Yeah, question. [faint speaking] Yeah, so the question is,
when we do the dot product do we turn the five by five
by three into one vector? Yeah, in essence that's what you're doing. You can, I mean, you
can think of it as just plopping it on and doing the
element-wise multiplication at each location, but this is
going to give you the same result as if you stretched out
the filter at that point, stretched out the input
volume that it's laid over, and then took the dot product, and that's what's written
here, yeah, question. [faint speaking] Oh, this is, so the question is, any intuition for why
this is a W transpose? And this was just, not really, this is just the notation
that we have here to make the math work
out as a dot product. So it just depends on whether,
how you're representing W and whether in this case
if we look at the W matrix this happens to be each column
and so we're just taking the transpose to get a row out of it. But there's no intuition here, we're just taking the filters of W and we're stretching it
out into a one D vector, and in order for it to be a dot product it has to be like a one
by, one by N vector. [faint speaking] Okay, so the question is, is W here not five by five
by three, it's one by 75. So that's the case, right, if we're going to do this dot product
of W transpose times X, we have to stretch it out first before we do the dot product. So we take the five by five by three, and we just take all these values and stretch it out into a long vector. And so again, similar
to the other question, the actual operation that we're doing here is plopping our filter on top of a spatial location in the image and multiplying all of the
corresponding values together, but in order just to make it
kind of an easy expression similar to what we've seen before we can also just stretch
each of these out, make sure that dimensions
are transposed correctly so that it works out as a dot product. Yeah, question. [faint speaking] Okay, the question is, how do we slide the filter over the image. We'll go into that next, yes. [faint speaking] Okay, so the question is,
should we rotate the kernel by 180 degrees to better
match the convolution, the definition of a convolution. And so the answer is that
we'll also show the equation for this later, but
we're using convolution as kind of a looser definition
of what's happening. So for people from signal processing, what we are actually technically doing, if you want to call this a convolution, is we're convolving with the
flipped version of the filter. But for the most part, we
just don't worry about this and we just, yeah, do this operation and it's like a convolution in spirit. Okay, so... Okay, so we had a question
earlier, how do we, you know, slide this over all the spatial locations. Right, so what we're going to do is we're going to take this
filter, we're going to start at the upper left-hand
corner and basically center our filter on top of every
pixel in this input volume. And at every position, we're
going to do this dot product and this will produce one value in our output activation map. And so then we're going
to just slide this around. The simplest version
is just at every pixel we're going to do this
operation and fill in the corresponding point
in our output activation. You can see here that the
dimensions are not exactly what would happen, right,
if you're going to do this. I had 32 by 32 in the input and I'm having 28 by 28 in the output, and so we'll go into
examples later of the math of exactly how this is going
to work out dimension-wise, but basically you have a choice of how you're going to slide this, whether you go at every
pixel or whether you slide, let's say, you know, two
input values over at a time, two pixels over at a time, and so you can get different size outputs depending on how you choose to slide. But you're basically doing this
operation in a grid fashion. Okay, so what we just saw earlier, this is taking one filter, sliding it over all of the spatial locations in the image and then we're going to get
this activation map out, right, which is the value of that
filter at every spatial location. And so when we're dealing
with a convolutional layer, we want to work with
multiple filters, right, because each filter is kind
of looking for a specific type of template or concept
in the input volume. And so we're going to have
a set of multiple filters, and so here I'm going
to take a second filter, this green filter, which is
again five by five by three, I'm going to slide this over
all of the spatial locations in my input volume, and
then I'm going to get out this second green activation
map also of the same size. And we can do this for as many filters as we want to have in this layer. So for example, if we have six filters, six of these five by five filters, then we're going to get in
total six activation maps out. All of, so we're going
to get this output volume that's going to be
basically six by 28 by 28. Right, and so a preview
of how we're going to use these convolutional layers
in our convolutional network is that our ConvNet is
basically going to be a sequence of these convolutional layers stacked on top of each other,
same way as what we had with the simple linear layers
in their neural network. And then we're going to intersperse these with activation functions, so for example, a ReLU
activation function. Right, and so you're going to
get something like Conv, ReLU, and usually also some pooling layers, and then you're just going
to get a sequence of these each creating an output
that's now going to be the input to the next convolutional layer. Okay, and so each of these
layers, as I said earlier, has multiple filters, right, many filters. And each of the filter is
producing an activation map. And so when you look at
multiple of these layers stacked together in a ConvNet,
what ends up happening is you end up learning this
hierarching of filters, where the filters at the
earlier layers usually represent low-level features that
you're looking for. So things kind of like edges, right. And then at the mid-level, you're going to get more
complex kinds of features, so maybe it's looking more for things like corners and blobs and so on. And then at higher-level features, you're going to get
things that are starting to more resemble concepts than blobs. And we'll go into more
detail later in the class in how you can actually
visualize all these features and try and interpret what your network, what kinds of features
your network is learning. But the important thing for
now is just to understand that what these features end up being when you have a whole stack of these, is these types of simple
to more complex features. [faint speaking] Yeah. Oh, okay. Oh, okay, so the question
is, what's the intuition for increasing the depth each time. So here I had three filters
in the original layer and then six filters in the next layer. Right, and so this is
mostly a design choice. You know, people in practice have found certain types of these
configurations to work better. And so later on we'll go into
case studies of different kinds of convolutional
neural network architectures and design choices for these and why certain ones
work better than others. But yeah, basically the choice of, you're going to have many design choices in a convolutional neural network, the size of your filter, the stride, how many filters you have, and so we'll talk about
this all more later. Question. [faint speaking] Yeah, so the question is,
as we're sliding this filter over the image spatially it
looks like we're sampling the edges and corners less
than the other locations. Yeah, that's a really good point, and we'll talk I think in a few slides about how we try and compensate for that. Okay, so each of these
convolutional layers that we have stacked together,
we saw how we're starting with more simpler features
and then aggregating these into more complex features later on. And so in practice this is compatible with what Hubel and Wiesel
noticed in their experiments, right, that we had these simple cells at the earlier stages of processing, followed by more complex cells later on. And so even though we didn't explicitly force our ConvNet to learn
these kinds of features, in practice when you give it this type of hierarchical structure and
train it using backpropagation, these are the kinds of filters
that end up being learned. [faint speaking] Okay, so yeah, so the question is, what are we seeing in
these visualizations. And so, alright so, in
these visualizations, like, if we look at this Conv1, the
first convolutional layer, each of these grid, each part
of this grid is a one neuron. And so what we've visualized here is what the input looks
like that maximizes the activation of that particular neuron. So what sort of image you would get that would give you the largest value, make that neuron fire and
have the largest value. And so the way we do this is basically by doing backpropagation from
a particular neuron activation and seeing what in the input will trigger, will give you the highest
values of this neuron. And this is something
that we'll talk about in much more depth in a later lecture about how we create all
of these visualizations. But basically each element of these grids is showing what in the
input would look like that basically maximizes the
activation of the neuron. So in a sense, what is
the neuron looking for? Okay, so here is an example
of some of the activation maps produced by each filter, right. So we can visualize up here on the top we have this whole row of
example five by five filters, and so this is basically a real
case from a trained ConvNet where each of these is
what a five by five filter looks like, and then as we
convolve this over an image, so in this case this I think
it's like a corner of a car, the car light, what the
activation looks like. Right, and so here for example, if we look at this first
one, this red filter, filter like with a red box around it, we'll see that it's looking for, the template looks like an
edge, right, an oriented edge. And so if you slide it over the image, it'll have a high value,
a more white value where there are edges in
this type of orientation. And so each of these activation
maps is kind of the output of sliding one of these filters over and where these filters
are causing, you know, where this sort of template
is more present in the image. And so the reason we call
these convolutional is because this is related to the
convolution of two signals, and so someone pointed out earlier that this is basically this
convolution equation over here, for people who have
seen convolutions before in signal processing, and in practice it's actually more like a correlation where we're convolving
with the flipped version of the filter, but this
is kind of a subtlety, it's not really important for
the purposes of this class. But basically if you're
writing out what you're doing, it has an expression that
looks something like this, which is the standard
definition of a convolution. But this is basically
just taking a filter, sliding it spatially over the image and computing the dot
product at every location. Okay, so you know, as I
had mentioned earlier, like what our total
convolutional neural network is going to look like is we're
going to have an input image, and then we're going to pass it through this sequence of layers, right, where we're going to have a
convolutional layer first. We usually have our
non-linear layer after that. So ReLU is something
that's very commonly used that we're going to talk about more later. And then we have these Conv,
ReLU, Conv, ReLU layers, and then once in a while
we'll use a pooling layer that we'll talk about later as well that basically downsamples the
size of our activation maps. And then finally at the end
of this we'll take our last convolutional layer output
and then we're going to use a fully connected layer
that we've seen before, connected to all of these
convolutional outputs, and use that to get a final score function basically like what we've
already been working with. Okay, so now let's work out some examples of how the spatial dimensions work out. So let's take our 32 by 32
by three image as before, right, and we have our five
by five by three filter that we're going to slide over this image. And we're going to see how
we're going to use that to produce exactly this
28 by 28 activation map. So let's assume that we actually
have a seven by seven input just to be simpler, and let's assume we have a three by three filter. So what we're going to do is we're going to take this filter, plop it down in our
upper left-hand corner, right, and we're going to
multiply, do the dot product, multiply all these values
together to get our first value, and this is going to go into
the upper left-hand value of our activation map. Right, and then what
we're going to do next is we're just going to take this filter, slide it one position to the right, and then we're going to get
another value out from here. And so we can continue with
this to have another value, another, and in the end
what we're going to get is a five by five output, right, because what fit was
basically sliding this filter a total of five spatial
locations horizontally and five spatial locations vertically. Okay, so as I said before there's different kinds of
design choices that we can make. Right, so previously I
slid it at every single spatial location and the
interval at which I slide I'm going to call the stride. And so previously we
used the stride of one. And so now let's see what happens if we have a stride of two. Right, so now we're going
to take our first location the same as before, and
then we're going to skip this time two pixels over
and we're going to get our next value centered at this location. Right, and so now if
we use a stride of two, we have in total three
of these that can fit, and so we're going to get
a three by three output. Okay, and so what happens when
we have a stride of three, what's the output size of this? And so in this case, right, we have three, we slide it over by three again, and the problem is that here
it actually doesn't fit. Right, so we slide it over by three and now it doesn't fit
nicely within the image. And so what we in practice we
just, it just doesn't work. We don't do convolutions like this because it's going to lead to
asymmetric outputs happening. Right, and so just kind
of looking at the way that we computed how many, what
the output size is going to be, this actually can work into a nice formula where we take our
dimension of our input N, we have our filter size
F, we have our stride at which we're sliding along,
and our final output size, the spatial dimension of each output size is going to be N minus F
divided by the stride plus one, right, and you can kind of
see this as a, you know, if I'm going to take my
filter, let's say I fill it in at the very last possible
position that it can be in and then take all the pixels before that, how many instances of moving
by this stride can I fit in. Right, and so that's how this
equation kind of works out. And so as we saw before,
right, if we have N equal seven and F equals three, if
we want a stride of one we plug it into this
formula, we get five by five as we had before, and the
same thing we had for two. And with a stride of three,
this doesn't really work out. And so in practice it's actually common to zero pad the borders in order to make the size work out to what we want it to. And so this is kind of
related to a question earlier, which is what do we do,
right, at the corners. And so what in practice happens is we're going to actually pad
our input image with zeros and so now you're going to
be able to place a filter centered at the upper
right-hand pixel location of your actual input image. Okay, so here's a question,
so who can tell me if I have my same input, seven by seven, three by three filter, stride one, but now I pad with a one pixel border, what's the size of my output going to be? [faint speaking] So, I heard some sixes, heard some sev, so remember we have this
formula that we had before. So if we plug in N is equal
to seven, F is equal to three, right, and then our
stride is equal to one. So what we actually get, so
actually this is giving us seven, four, so seven
minus three is four, divided by one plus one is five. And so this is what we had before. So we actually need to adjust
this formula a little bit, right, so this was actually,
this formula is the case where we don't have zero padded pixels. But if we do pad it, then if
you now take your new output and you slide it along, you'll see that actually
seven of the filters fit, so you get a seven by seven output. And plugging in our
original formula, right, so our N now is not seven, it's nine, so if we go back here
we have N equals nine minus a filter size of
three, which gives six. Right, divided by our
stride, which is one, and so still six, and then
plus one we get seven. Right, and so once you've padded it you want to incorporate this
padding into your formula. Yes, question. [faint speaking] Seven, okay, so the question is, what's the actual output of the size, is it seven by seven or
seven by seven by three? The output is going to be seven by seven by the number of filters that you have. So remember each filter is
going to do a dot product through the entire depth
of your input volume. But then that's going to
produce one number, right, so each filter is, let's
see if we can go back here. Each filter is producing
a one by seven by seven in this case activation map
output, and so the depth is going to be the number
of filters that we have. [faint speaking] Sorry, let me just, one second go back. Okay, can you repeat your question again? [muffled speaking] Okay, so the question is, how
does this connect to before when we had a 32 by 32
by three input, right. So our input had depth
and here in this example I'm showing a 2D example with no depth. And so yeah, I'm showing
this for simplicity but in practice you're going to take your, you're going to multiply
throughout the entire depth as we had before, so you're going to, your filter is going to be
in this case a three be three spatial filter by whatever
input depth that you had. So three by three by three in this case. Yeah, everything else stays the same. Yes, question. [muffled speaking] Yeah, so the question
is, does the zero padding add some sort of extraneous
features at the corners? And yeah, so I mean, we're
doing our best to still, get some value and do, like, process that region of the image, and so zero padding is
kind of one way to do this, where I guess we can, we are detecting part of this template in this region. There's also other ways
to do this that, you know, you can try and like,
mirror the values here or extend them, and so it
doesn't have to be zero padding, but in practice this is one
thing that works reasonably. And so, yeah, so there is a
little bit of kind of artifacts at the edge and we sort of just, you do your best to deal with it. And in practice this works reasonably. I think there was another question. Yeah, question. [faint speaking] So if we have non-square
images, do we ever use a stride that's different
horizontally and vertically? So, I mean, there's nothing
stopping you from doing that, you could, but in practice we just usually take the same stride, we
usually operate square regions and we just, yeah we usually just take the same stride everywhere
and it's sort of like, in a sense it's a little bit like, it's a little bit like the
resolution at which you're, you know, looking at this image, and so usually there's kind
of, you might want to match sort of your horizontal
and vertical resolutions. But, yeah, so in practice you could but really people don't do that. Okay, another question. [faint speaking] So the question is, why
do we do zero padding? So the way we do zero padding is to maintain the same
input size as we had before. Right, so we started with seven by seven, and if we looked at just
starting your filter from the upper left-hand
corner, filling everything in, right, then we get a smaller size output, but we would like to maintain
our full size output. Okay, so, yeah, so we saw how padding
can basically help you maintain the size of the
output that you want, as well as apply your filter at these, like, corner regions and edge regions. And so in general in terms of choosing, you know, your stride, your
filter, your filter size, your stride size, zero
padding, what's common to see is filters of size three
by three, five by five, seven by seven, these are
pretty common filter sizes. And so each of these, for three by three you will want to zero pad with one in order to maintain
the same spatial size. If you're going to do five by five, you can work out the math,
but it's going to come out to you want to zero pad by two. And then for seven you
want to zero pad by three. Okay, and so again you
know, the motivation for doing this type of zero padding and trying to maintain
the input size, right, so we kind of alluded to this before, but if you have multiple of
these layers stacked together... So if you have multiple of
these layers stacked together you'll see that, you know,
if we don't do this kind of zero padding, or any kind of padding, we're going to really
quickly shrink the size of the outputs that we have. Right, and so this is not
something that we want. Like, you can imagine if you
have a pretty deep network then very quickly your, the
size of your activation maps is going to shrink to
something very small. And this is bad both because
we're kind of losing out on some of this information, right, now you're using a much
smaller number of values in order to represent your original image, so you don't want that. And then at the same time also as we talked about this earlier, your also kind of losing sort of some of
this edge information, corner information that each time we're losing out and
shrinking that further. Okay, so let's go through
a couple more examples of computing some of these sizes. So let's say that we have an input volume which is 32 by 32 by three. And here we have 10 five by five filters. Let's use stride one and pad two. And so who can tell me what's the output volume size of this? So you can think about
the formula earlier. Sorry, what was it? [faint speaking] 32 by 32 by 10, yes that's correct. And so the way we can see this, right, is so we have our input size, F is 32. Then in this case we want to augment it by the padding that we added onto this. So we padded it two in
each dimension, right, so we're actually going to get,
total width and total height's going to be 32 plus four on each side. And then minus our filter size five, divided by one plus one and we get 32. So our output is going to
be 32 by 32 for each filter. And then we have 10 filters total, so we have 10 of these activation maps, and our total output volume
is going to be 32 by 32 by 10. Okay, next question, so what's the number of
parameters in this layer? So remember we have 10
five by five filters. [faint speaking] I kind of heard something,
but it was quiet. Can you guys speak up? 250, okay so I heard 250, which is close, but remember that we're
also, our input volume, each of these filters
goes through by depth. So maybe this wasn't clearly written here because each of the filters
is five by five spatially, but implicitly we also have
the depth in here, right. It's going to go through the whole volume. So I heard, yeah, 750 I heard. Almost there, this is
kind of a trick question 'cause also remember
we usually always have a bias term, right, so
in practice each filter has five by five by three
weights, plus our one bias term, we have 76 parameters per filter, and then we have 10 of these total, and so there's 760 total parameters. Okay, and so here's just a summary of the convolutional layer
that you guys can read a little bit more carefully later on. But we have our input volume
of a certain dimension, we have all of these choice,
we have our filters, right, where we have number of
filters, the filter size, the stride of the size,
the amount of zero padding, and you basically can use all of these, go through the computations
that we talked about earlier in order to find out what
your output volume is actually going to be and how many total
parameters that you have. And so some common settings of this. You know, we talked earlier
about common filter sizes of three by three, five by five. Stride is usually one
and two is pretty common. And then your padding P is
going to be whatever fits, like, whatever will
preserve your spatial extent is what's common. And then the total number of filters K, usually we use powers of two
just to be nice, so, you know, 32, 64, 128 and so on, 512, these are pretty common
numbers that you'll see. And just as an aside, we can also do a one by one convolution, this still makes perfect sense where given a one by one convolution we still slide it over
each spatial extent, but now, you know, the spatial region is not really five by five it's just kind of the
trivial case of one by one, but we are still having this filter go through the entire depth. Right, so this is going
to be a dot product through the entire depth
of your input volume. And so the output here, right,
if we have an input volume of 56 by 56 by 64 depth and
we're going to do one by one convolution with 32 filters,
then our output is going to be 56 by 56 by our number of filters, 32. Okay, and so here's an example
of a convolutional layer in TORCH, a deep learning framework. And so you'll see that,
you know, last lecture we talked about how you can go into these deep learning frameworks,
you can see these definitions of each layer, right,
where they have kind of the forward pass and the backward pass implemented for each layer. And so you'll see convolutions, spatial convolution is going
to be just one of these, and then the arguments
that it's going to take are going to be all of these
design choices of, you know, I mean, I guess your
input and output sizes, but also your choices of
like your kernel width, your kernel size, padding,
and these kinds of things. Right, and so if we look at
another framework, Caffe, you'll see something very similar, where again now when you're
defining your network you define networks in Caffe
using this kind of, you know, proto text file where you're specifying each of your design choices for your layer and you can see for a convolutional layer will say things like, you
know, the number of outputs that we have, this is going
to be the number of filters for Caffe, as well as the kernel
size and stride and so on. Okay, and so I guess before I go on, any questions about convolution, how the convolution operation works? Yes, question. [faint speaking] Yeah, so the question is, what's the intuition behind
how you choose your stride. And so at one sense it's
kind of the resolution at which you slide it on, and
usually the reason behind this is because when we have a larger stride what we end up getting as the output is a down sampled image, right, and so what this downsampled
image lets us have is both, it's a way, it's kind of
like pooling in a sense but it's just a different
and sometimes works better way of doing pooling is one
of the intuitions behind this, 'cause you get the same effect
of downsampling your image, and then also as you're doing
this you're reducing the size of the activation maps
that you're dealing with at each layer, right, and so
this also affects later on the total number of
parameters that you have because for example at the
end of all your Conv layers, now you might put on fully
connected layers on top, for example, and now the
fully connected layer's going to be connected to every value of your convolutional output, right, and so a smaller one will
give you smaller number of parameters, and so now
you can get into, like, basically thinking about
trade offs of, you know, number of parameters you
have, the size of your model, overfitting, things
like that, and so yeah, these are kind of some of the things that you want to think about
with choosing your stride. Okay, so now if we look a
little bit at kind of the, you know, brain neuron view
of a convolutional layer, similar to what we
looked at for the neurons in the last lecture. So what we have is that
at every spatial location, we take a dot product between a filter and a specific part of the image, right, and we get one number out from here. And so this is the same idea of doing these types
of dot products, right, taking your input, weighting
it by these Ws, right, values of your filter, these
weights that are the synapses, and getting a value out. But the main difference
here is just that now your neuron has local connectivity. So instead of being connected
to the entire input, it's just looking at a local
region spatially of your image. And so this looks at a local region and then now you're going
to get kind of, you know, this, how much this
neuron is being triggered at every spatial location in your image. Right, so now you preserve
the spatial structure and you can say, you
know, be able to reason on top of these kinds of
activation maps in later layers. And just a little bit of terminology, again for, you know, we have
this five by five filter, we can also call this a
five by five receptive field for the neuron, because this is, the receptive field is
basically the, you know, input field that this field of vision that this neuron is receiving, right, and so that's just another common term that you'll hear for this. And then again remember each
of these five by five filters we're sliding them over
the spatial locations but they're the same set of weights, they share the same parameters. Okay, and so, you know, as we talked about what we're going to get at this output is going to be this volume, right, where spatially we have,
you know, let's say 28 by 28 and then our number of
filters is the depth. And so for example with five filters, what we're going to
get out is this 3D grid that's 28 by 28 by five. And so if you look at the filters across in one spatial location
of the activation volume and going through depth
these five neurons, all of these neurons, basically the way you can interpret this is they're all looking at the same region in the input volume, but they're just looking
for different things, right. So they're different filters applied to the same spatial
location in the image. And so just a reminder
again kind of comparing with the fully connected layer
that we talked about earlier. In that case, right, if we
look at each of the neurons in our activation or
output, each of the neurons was connected to the
entire stretched out input, so it looked at the
entire full input volume, compared to now where each one just looks at this local spatial region. Question. [muffled talking] Okay, so the question
is, within a given layer, are the filters completely symmetric? So what do you mean by
symmetric exactly, I guess? Right, so okay, so the
filters, are the filters doing, they're doing the same dimension,
the same calculation, yes. Okay, so is there anything different other than they have the
same parameter values? No, so you're exactly right, we're just taking a filter
with a given set of, you know, five by five by three parameter values, and we just slide this
in exactly the same way over the entire input volume
to get an activation map. Okay, so you know, we've
gone into a lot of detail in what these convolutional
layers look like, and so now I'm just going to go briefly through the other layers that we have that form this entire
convolutional network. Right, so remember again,
we have convolutional layers interspersed with pooling
layers once in a while as well as these non-linearities. Okay, so what the pooling layers do is that they make the representations smaller and more manageable, right, so we talked about this earlier with someone asked a question of
why we would want to make the representation smaller. And so this is again for it to have fewer, it effects the number of
parameters that you have at the end as well as basically does some, you know, invariance over a given region. And so what the pooling layer does is it does exactly just downsamples, and it takes your input
volume, so for example, 224 by 224 by 64, and
spatially downsamples this. So in the end you'll get out 112 by 112. And it's important to note
this doesn't do anything in the depth, right, we're
only pooling spatially. So the number of, your input depth is going to be the same
as your output depth. And so, for example, a common
way to do this is max pooling. So in this case our pooling
layer also has a filter size and this filter size is
going to be the region at which we pool over,
right, so in this case if we have two by two filters,
we're going to slide this, and so, here, we also have
stride two in this case, so we're going to take this filter and we're going to slide
it along our input volume in exactly the same way
as we did for convolution. But here instead of
doing these dot products, we just take the maximum value of the input volume in that region. Right, so here if we
look at the red values, the value of that will
be six is the largest. If we look at the greens
it's going to give an eight, and then we have a three and a four. Yes, question. [muffled speaking] Yeah, so the question is, is
it typical to set up the stride so that there isn't an overlap? And yeah, so for the pooling layers it is, I think the more common thing to do is to have them not have any overlap, and I guess the way you
can think about this is basically we just want to downsample and so it makes sense to
kind of look at this region and just get one value
to represent this region and then just look at the
next region and so on. Yeah, question. [faint speaking] Okay, so the question
is, why is max pooling better than just taking the, doing something like average pooling? Yes, that's a good point,
like, average pooling is also something that you can do, and intuition behind why
max pooling is commonly used is that it can have
this interpretation of, you know, if this is, these
are activations of my neurons, right, and so each value is kind of how much this neuron
fired in this location, how much this filter
fired in this location. And so you can think of
max pooling as saying, you know, giving a signal of
how much did this filter fire at any location in this image. Right, and if we're
thinking about detecting, you know, doing recognition, this might make some intuitive
sense where you're saying, well, you know, whether a
light or whether some aspect of your image that you're looking for, whether it happens anywhere in this region we want to fire at with a high value. Question. [muffled speaking] Yeah, so the question is,
since pooling and stride both have the same effect of downsampling, can you just use stride
instead of pooling and so on? Yeah, and so in practice I think looking at more recent
neural network architectures people have begun to use stride more in order to do the downsampling
instead of just pooling. And I think this gets into
things like, you know, also like fractional strides
and things that you can do. But in practice this in a
sense maybe has a little bit better way to get better
results using that, so. Yeah, so I think using
stride is definitely, you can do it and people are doing it. Okay, so let's see, where were we. Okay, so yeah, so with
these pooling layers, so again, there's right, some
design choices that you make, you take this input volume of W by H by D, and then you're going to
set your hyperparameters for design choices of your filter size or the spatial extent over
which you are pooling, as well as your stride, and
then you can again compute your output volume using the
same equation that you used earlier for convolution, it
still applies here, right, so we still have our W total extent minus filter size divided
by stride plus one. Okay, and so just one other thing to note, it's also, typically people
don't really use zero padding for the pooling layers
because you're just trying to do a direct downsampling, right, so there isn't this problem of like, applying a filter at the corner and having some part of the
filter go off your input volume. And so for pooling we don't
usually have to worry about this and we just directly downsample. And so some common settings
for the pooling layer is a filter size of two by
two or three by three strides. Two by two, you know, you can have, also you can still have
pooling of two by two even with a filter size of three by three, I think someone asked that earlier, but in practice it's pretty
common just to have two by two. Okay, so now we've talked about
these convolutional layers, the ReLU layers were the
same as what we had before with the, you know, just
the base neural network that we talked about last lecture. So we intersperse these and
then we have a pooling layer every once in a while when we
feel like downsampling, right. And then the last thing is that at the end we want to have a fully connected layer. And so this will be just exactly the same as the fully connected layers
that you've seen before. So in this case now what we do is we take the convolutional
network output, at the last layer we have some volume, so we're going to have width
by height by some depth, and we just take all of these and we essentially just
stretch these out, right. And so now we're going
to get the same kind of, you know, basically 1D
input that we're used to for a vanilla neural network,
and then we're going to apply this fully connected layer on top, so now we're going to have connections to every one of these
convolutional map outputs. And so what you can think
of this is basically, now instead of preserving, you know, before we were preserving
spatial structure, right, and so but at the
last layer at the end, we want to aggregate all of this together and we want to reason basically on top of all of this as we had before. And so what you get from that is just our score outputs as we had earlier. Okay, so-- - [Student] This is
sort of a silly question about this visual. Like what are the 16 pixels
that are on the far right, like what should be interpreting those as? - Okay, so the question
is, what are the 16 pixels that are on the far
right, do you mean the-- - [Student] Like that column of-- - [Instructor] Oh, each column. - [Student] The column
on the far right, yeah. - [Instructor] The green
ones or the black ones? - [Student] The ones labeled pool. - The one with hold on, pool. Oh, okay, yeah, so the question is how do we interpret this column,
right, for example at pool. And so what we're showing
here is each of these columns is the output activation maps, right, the output from one of these layers. And so starting from the
beginning, we have our car, after the convolutional layer we now have these activation
maps of each of the filters slid spatially over the input image. Then we pass that through a ReLU, so you can see the values
coming out from there. And then going all the way over, and so what you get for the pooling layer is that it's really just taking the output of the ReLU layer that came just before it
and then it's pooling it. So it's going to downsample it, right, and then it's going to take the max value in each filter location. And so now if you look at
this pool layer output, like, for example, the last
one that you were mentioning, it looks the same as this ReLU output except that it's downsampled
and that it has this kind of max value at every spatial location and so that's the minor difference that you'll see between those two. [distant speaking] So the question is, now this looks like just a very small amount
of information, right, so how can it know to
classify it from here? And so the way that you
should think about this is that each of these values inside one of these pool
outputs is actually, it's the accumulation of all
the processing that you've done throughout this entire network, right. So it's at the very top of your hierarchy, and so each actually represents kind of a higher level concept. So we saw before, you know,
for example, Hubel and Wiesel and building up these
hierarchical filters, where at the bottom level
we're looking for edges, right, or things like very simple
structures, like edges. And so after your convolutional layer the outputs that you see
here in this first column is basically how much do
specific, for example, edges, fire at different locations in the image. But then as you go through
you're going to get more complex, it's looking for more
complex things, right, and so the next convolutional layer is going to fire at how much, you know, let's say certain kinds of
corners show up in the image, right, because it's reasoning. Its input is not the original image, its input is the output, it's
already the edge maps, right, so it's reasoning on top of edge maps, and so that allows it to get more complex, detect more complex things. And so by the time you get all the way up to this last pooling layer,
each value is representing how much a relatively complex
sort of template is firing. Right, and so because of
that now you can just have a fully connected layer,
you're just aggregating all of this information together to get, you know, a score for your class. So each of these values is how much a pretty complicated
complex concept is firing. Question. [faint speaking] So the question is, when
do you know you've done enough pooling to do the classification? And the answer is you just try and see. So in practice, you know,
these are all design choices and you can think about this
a little bit intuitively, right, like you want to pool
but if you pool too much you're going to have very few values representing your entire image and so on, so it's just kind of a trade off. Something reasonable
versus people have tried a lot of different configurations so you'll probably cross validate, right, and try over different pooling sizes, different filter sizes,
different number of layers, and see what works best for
your problem because yeah, like every problem with
different data is going to, you know, different set of these sorts of hyperparameters might work best. Okay, so last thing, just
wanted to point you guys to this demo of training a ConvNet, which was created by Andre Karpathy, the originator of this class. And so he wrote up this demo where you can basically
train a ConvNet on CIFAR-10, the dataset that we've seen
before, right, with 10 classes. And what's nice about
this demo is you can, it basically plots for you
what each of these filters look like, what the
activation maps look like. So some of the images I showed earlier were taken from this demo. And so you can go try it
out, play around with it, and you know, just go through
and try and get a sense for what these activation maps look like. And just one thing to note, usually the first layer
activation maps are, you can interpret them, right, because they're operating
directly on the input image so you can see what these templates mean. As you get to higher level layers it starts getting really hard, like how do you actually
interpret what do these mean. So for the most part it's
just hard to interpret so you shouldn't, you know, don't worry if you can't really make
sense of what's going on. But it's still nice just
to see the entire flow and what outputs are coming out. Okay, so in summary, so
today we talked about how convolutional neural networks work, how they're basically stacks of these convolutional and pooling layers followed by fully connected
layers at the end. There's been a trend towards
having smaller filters and deeper architectures,
so we'll talk more about case studies for
some of these later on. There's also been a trend
towards getting rid of these pooling and fully
connected layers entirely. So just keeping these, just
having, you know, Conv layers, very deep networks of Conv layers, so again we'll discuss
all of this later on. And then typical architectures
again look like this, you know, as we had earlier. Conv, ReLU for some N number of steps followed by a pool every once in a while, this whole thing repeated
some number of times, and then followed by fully
connected ReLU layers that we saw earlier, you know, one or two or just a few of these, and then a softmax at the
end for your class scores. And so, you know, some typical values you might have N up to five of these. You're going to have pretty deep layers of Conv, ReLU, pool
sequences, and then usually just a couple of these fully
connected layers at the end. But we'll also go into
some newer architectures like ResNet and GoogLeNet,
which challenge this and will give pretty different
types of architectures. Okay, thank you and
see you guys next time.