- All right welcome to lecture nine. So today we will be talking
about CNN Architectures. And just a few administrative points before we get started,
assignment two is due Thursday. The mid term will be in
class on Tuesday May ninth, so next week and it will
cover material through Tuesday through this coming Thursday May fourth. So everything up to
recurrent neural networks are going to be fair game. The poster session
we've decided on a time, it's going to be Tuesday June sixth from twelve to three p.m. So this is the last week of classes. So we have our our poster
session a little bit early during the last week so that after that, once you guys get feedback
you still have some time to work for your final report
which will be due finals week. Okay, so just a quick review of last time. Last time we talked
about different kinds of deep learning frameworks. We talked about you know
PyTorch, TensorFlow, Caffe2 and we saw that using
these kinds of frameworks we were able to easily build
big computational graphs, for example very large neural
networks and comm nets, and be able to really
easily compute gradients in these graphs. So to compute all of the
gradients for all the intermediate variables weights inputs and
use that to train our models and to run all this efficiently on GPUs And we saw that for a
lot of these frameworks the way this works is by
working with these modularized layers that you guys have
been working writing with, in your home works as well
where we have a forward pass, we have a backward pass, and then in our final model architecture, all we need to do then is to just define all of these sequence of layers together. So using that we're able
to very easily be able to build up very complex
network architectures. So today we're going to talk
about some specific kinds of CNN Architectures that are
used today in cutting edge applications and research. And so we'll go into depth
in some of the most commonly used architectures for
these that are winners of ImageNet classification benchmarks. So in chronological
order AlexNet, VGG net, GoogLeNet, and ResNet. And so these will go into a lot of depth. And then I'll also after
that, briefly go through some other architectures that are not as prominently used these
days, but are interesting either from a historical perspective, or as recent areas of research. Okay, so just a quick review. We talked a long time ago about LeNet, which was one of the first
instantiations of a comNet that was successfully used in practice. And so this was the comNet
that took an input image, used com filters five by five filters applied at stride one and
had a couple of conv layers, a few pooling layers and then
some fully connected layers at the end. And this fairly simple comNet
was very successfully applied to digit recognition. So AlexNet from 2012 which
you guys have also heard already before in previous classes, was the first large scale
convolutional neural network that was able to do well on
the ImageNet classification task so in 2012 AlexNet was
entered in the competition, and was able to outperform
all previous non deep learning based models
by a significant margin, and so this was the comNet
that started the spree of comNet research and usage afterwards. And so the basic comNet
AlexNet architecture is a conv layer followed by pooling layer, normalization, com pool norm, and then a few more conv
layers, a pooling layer, and then several fully
connected layers afterwards. So this actually looks very
similar to the LeNet network that we just saw. There's just more layers in total. There is five of these conv layers, and two fully connected layers before the final fully connected
layer going to the output classes. So let's first get a sense
of the sizes involved in the AlexNet. So if we look at the input to the AlexNet this was trained on ImageNet, with inputs at a size 227 by 227 by 3 images. And if we look at this first
layer which is a conv layer for the AlexNet, it's 11 by 11 filters, 96 of these applied at stride 4. So let's just think
about this for a moment. What's the output volume
size of this first layer? And there's a hint. So remember we have our input size, we have our convolutional filters, ray. And we have this formula,
which is the hint over here that gives you the size
of the output dimensions after applying com right? So remember it was the full
image, minus the filter size, divided by the stride, plus one. So given that that's
written up here for you 55, does anyone have a guess at
what's the final output size after this conv layer? [student speaks off mic] - So I had 55 by 55 by 96, yep. That's correct. Right so our spatial
dimensions at the output are going to be 55 in each
dimension and then we have 96 total filters so the
depth after our conv layer is going to be 96. So that's the output volume. And what's the total number
of parameters in this layer? So remember we have 96 11 by 11 filters. [student speaks off mic] - [Lecturer] 96 by 11 by 11, almost. So yes, so I had another by three, yes that's correct. So each of the filters is going to see through a local region
of 11 by 11 by three, right because the input depth was three. And so, that's each filter
size, times we have 96 of these total. And so there's 35K parameters
in this first layer. Okay, so now if we look
at the second layer this is a pooling layer
right and in this case we have three three by three
filters applied at stride two. So what's the output volume
of this layer after pooling? And again we have a hint, very
similar to the last question. Okay, 27 by 27 by 96. Yes that's correct. Right so the pooling layer
is basically going to use this formula that we had here. Again because these are pooling
applied at a stride of two so we're going to use the
same formula to determine the spatial dimensions and
so the spatial dimensions are going to be 27 by
27, and pooling preserves the depth. So we had 96 as depth as input, and it's still going to be 96 depth at output. And next question. What's the number of
parameters in this layer? I hear some muttering. [student answers off mic] - Nothing. Okay. Yes, so pooling layer
has no parameters, so, kind of a trick question. Okay, so we can basically, yes, question? [student speaks off mic] - The question is, why are
there no parameters in the pooling layer? The parameters are the weights right, that we're trying to learn. And so convolutional layers
have weights that we learn but pooling all we do is have a rule, we look at the pooling region, and we take the max. So there's no parameters that are learned. So we can keep on doing
this and you can just repeat the process and it's kind of
a good exercise to go through this and figure out the
sizes, the parameters, at every layer. And so if you do this all the way, you can look at this is
the final architecture that you can work with. There's 11 by 11 filters at the beginning, then five by five and some
three by three filters. And so these are generally
pretty familiar looking sizes that you've seen before
and then at the end we have a couple of fully connected layers of size 4096 and finally the last layer, is FC8 going to the soft max, which is going to the
1000 ImageNet classes. And just a couple of details about this, it was the first use of
the ReLu non-linearity that we've talked about
that's the most commonly used non-linearity. They used local response
normalization layers basically trying to
normalize the response across neighboring channels but this
is something that's not really used anymore. It turned out not to, other people showed that it didn't have so much of an effect. There's a lot of heavy data augmentation, and so you can look in the
paper for more details, but things like flipping,
jittering, cropping, color normalization all of these things which you'll probably
find useful for you when you're working on your
projects for example, so a lot of data augmentation here. They also use dropout batch size of 128, and learned with SGD with
momentum which we talked about in an earlier lecture,
and basically just started with a base learning
rate of 1e negative 2. Every time it plateaus,
reduce by a factor of 10 and then just keep going. Until they finish training and a little bit of weight
decay and in the end, in order to get the best
numbers they also did an ensembling of models and
so training multiple of these, averaging them together and
this also gives an improvement in performance. And so one other thing I want to point out is that if you look at this
AlexNet diagram up here, it looks kind of like the
normal comNet diagrams that we've been seeing,
except for one difference, which is that it's, you
can see it's kind of split in these two different rows
or columns going across. And so the reason for this
is mostly historical note, so AlexNet was trained
on GTX580 GPUs older GPUs that only had three gigs of memory. So it couldn't actually fit
this entire network on here, and so what they ended up doing, was they spread the
network across two GPUs. So on each GPU you would
have half of the neurons, or half of the feature maps. And so for example if you
look at this first conv layer, we have 55 by 55 by 96 output, but if you look at this diagram carefully, you can zoom in later in the actual paper, you can see that, it's actually only 48 depth-wise, on each GPU, and so they just spread
it, the feature maps, directly in half. And so what happens is that
for most of these layers, for example com one, two, four and five, the connections are only with feature maps on the same GPU, so you
would take as input, half of the feature maps
that were on the the same GPU as before and you don't
look at the full 96 feature maps for example. You just take as input the
48 in that first layer. And then there's a few
layers so com three, as well as FC six, seven and eight, where here are the GPUs
do talk to each other and so there's connections
with all feature maps in the preceding layer. so there's communication across the GPUs, and each of these neurons
are then connected to the full depth of the
previous input layer. Question. - [Student] It says the
full simplified AlexNetwork architecture. [mumbles] - Oh okay, so the question
is why does it say full simplified AlexNet architecture here? It just says that because I
didn't put all the details on here, so for example this
is the full set of layers in the architecture, and
the strides and so on, but for example the normalization
layer, there's other, these details are not written on here. And then just one little note, if you look at the paper
and try and write out the math and architectures and so on, there's a little bit of
an issue on the very first layer they'll say if
you'll look in the figure they'll say 224 by 224 , but there's actually some
kind of funny pattern going on and so the
numbers actually work out if you look at it as 227. AlexNet was the winner of
the ImageNet classification benchmark in 2012, you can see that it cut the error rate
by quite a large margin. It was the first CNN base
winner, and it was widely used as a base to our architecture almost ubiquitously from then
until a couple years ago. It's still used quite a bit. It's used in transfer learning
for lots of different tasks and so it was used for
basically a long time, and it was very famous and
now though there's been some more recent architectures
that have generally just had better performance
and so we'll talk about these next and these are going to be
the more common architectures that you'll be wanting to use in practice. So just quickly first in
2013 the ImageNet challenge was won by something called a ZFNet. Yes, question. [student speaks off mic] - So the question is intuition why AlexNet was so much better than
the ones that came before, DefLearning comNets [mumbles] this is just a very different kind of
approach in architecture. So this was the first deep
learning based approach first comNet that was used. So in 2013 the challenge
was won by something called a ZFNet [Zeller Fergus Net]
named after the creators. And so this mostly was
improving hyper parameters over the AlexNet. It had the same number of layers, the same general structure
and they made a few changes things like
changing the stride size, different numbers of filters
and after playing around with these hyper parameters more, they were able to improve the error rate. But it's still basically the same idea. So in 2014 there are a
couple of architectures that were now more significantly different and made another jump in performance, and the main difference with
these networks first of all was much deeper networks. So from the eight layer
network that was in 2012 and 2013, now in 2014 we
had two very close winners that were around 19 layers and 22 layers. So significantly deeper. And the winner of this
was GoogleNet, from Google but very close behind was
something called VGGNet from Oxford, and on actually
the localization challenge VGG got first place in
some of the other tracks. So these were both very,
very strong networks. So let's first look at VGG
in a little bit more detail. And so the VGG network is the
idea of much deeper networks and with much smaller filters. So they increased the number of layers from eight layers in AlexNet
right to now they had models with 16 to 19 layers in VGGNet. And one key thing that they
did was they kept very small filter so only three by
three conv all the way, which is basically the
smallest com filter size that is looking at a little
bit of the neighboring pixels. And they just kept this
very simple structure of three by three convs
with the periodic pooling all the way through the network. And it's very simple elegant
network architecture, was able to get 7.3% top five error on the ImageNet challenge. So first the question of
why use smaller filters. So when we take these
small filters now we have fewer parameters and we
try and stack more of them instead of having larger filters, have smaller filters
with more depth instead, have more of these filters instead, what happens is that you end
up having the same effective receptive field as if you
only have one seven by seven convolutional layer. So here's a question, what is
the effective receptive field of three of these three
by three conv layers with stride one? So if you were to stack three
three by three conv layers with Stride one what's the
effective receptive field, the total area of the input,
spatial area of the input that enure at the top
layer of the three layers is looking at. So I heard fifteen pixels,
why fifteen pixels? - [Student] Okay, so the
reason given was because they overlap-- - Okay, so the reason given
was because they overlap. So it's on the right track. What actually is happening
though is you have to see, at the first layer, the
receptive field is going to be three by three right? And then at the second layer, each of these neurons in the second layer is going to look at three
by three other first layer filters, but the corners
of these three by three have an additional pixel on each side, that is looking at in
the original input layer. So the second layer is actually
looking at five by five receptive field and then
if you do this again, the third layer is
looking at three by three in the second layer but this is going to, if you just draw out this
pyramid is looking at seven by seven in the input layer. So the effective receptive field here is going to be seven by seven. Which is the same as one
seven by seven conv layer. So what happens is that
this has the same effective receptive field as a
seven by seven conv layer but it's deeper. It's able to have more
non-linearities in there, and it's also fewer parameters. So if you look at the
total number of parameters, each of these conv filters
for the three by threes is going to have nine parameters
in each conv [mumbles] three times three, and
then times the input depth, so three times three times
C, times this total number of output feature maps, which is again C is we're going to preserve the total number of channels. So you get three times three, times C times C for each of these layers, and we have three layers
so it's going to be three times this number, compared to if you had a
single seven by seven layer then you get, by the same reasoning, seven squared times C squared. So you're going to have
fewer parameters total, which is nice. So now if we look at
this full network here there's a lot of numbers up
here that you can go back and look at more carefully
but if we look at all of the sizes and number
of parameters the same way that we calculated the
example for AlexNet, this is a good exercise to go through, we can see that you
know going the same way we have a couple of these conv
layers and a pooling layer a couple more conv layers,
pooling layer, several more conv layers and so on. And so this just keeps going up. And if you counted the total
number of convolutional and fully connected layers,
we're going to have 16 in this case for VGG 16, and then VGG 19, it's just a very similar architecture, but with a few
more conv layers in there. And so the total memory
usage of this network, so just making a forward
pass through counting up all of these numbers so
in the memory numbers here written in terms of the total numbers, like we calculated earlier, and if you look at four bytes per number, this is going to be
about 100 megs per image, and so this is the scale
of the memory usage that's happening and this is
only for a forward pass right, when you do a backward pass
you're going to have to store more and so this is
pretty heavy memory wise. 100 megs per image, if
you have on five gigs of total memory, then
you're only going to be able to store about 50 of these. And so also the total number
of parameters here we have is 138 million parameters in this network, and this compares with
60 million for AlexNet. Question? [student speaks off mic] - So the question is what
do we mean by deeper, is it the number of
filters, number of layers? So deeper in this case is
always referring to layers. So there are two usages of the word depth which is confusing one is
the depth rate per channel, width by height by depth, you can use the word depth here, but in general we talk about
the depth of a network, this is going to be the
total number of layers in the network, and usually in particular we're counting the total
number of weight layers. So the total number of
layers with trainable weight, so convolutional layers
and fully connected layers. [student mumbles off mic] - Okay, so the question
is, within each layer what do different filters need? And so we talked about this
back in the comNet lecture, so you can also go back and refer to that, but each filter is a set of
let's say three by three convs, so each filter is looking at a, is a set of weight looking at
a three by three value input input depth, and this
produces one feature map, one activation map of
all the responses of the different spatial locations. And then we have we can have
as many filters as we want right so for example 96 and each of these is going to produce a feature map. And so it's just like
each filter corresponds to a different pattern
that we're looking for in the input that we
convolve around and we see the responses everywhere in the input, we create a map of these
and then another filter will we convolve over the
image and create another map. Question. [student speaks off mic] - So question is, is
there intuition behind, as you go deeper into the network
we have more channel depth so more number of filters
right and so you can have any design that you want so
you don't have to do this. In practice you will see this
happen a lot of the times and one of the reasons is
people try and maintain kind of a relatively
constant level of compute, so as you go higher up or
deeper into your network, you're usually also using
basically down sampling and having smaller total
spatial area and then so then they also increase now you
increase by depth a little bit, it's not as expensive
now to increase by depth because it's spatially smaller and so, yeah that's just a reason. Question. [student speaks off mic] - So performance-wise is
there any reason to use SBN [mumbles] instead
of SouthMax [mumbles], so no, for a classifier
you can use either one, and you did that earlier
in the class as well, but in general SouthMax losses, have generally worked
well and been standard use for classification here. Okay yeah one more question. [student mumbles off mic] - Yes, so the question
is, we don't have to store all of the memory like we
can throw away the parts that we don't need and so on? And yes this is true. Some of this you don't need to keep, but you're also going to
be doing a backwards pass through ware for the most part, when you were doing the chain rule and so on you needed
a lot of these activations as part of it and so in
large part a lot of this does need to be kept. So if we look at the distribution
of where memory is used and where parameters are,
you can see that a lot of memories in these early
layers right where you still have spatial dimensions you're
going to have more memory usage and then a lot of the
parameters are actually in the last layers, the
fully connected layers have a huge number of parameters right, because we have all of
these dense connections. And so that's something
just to know and then keep in mind so later on we'll
see some networks actually get rid of these fully
connected layers and be able to save a lot on the number of parameters. And then just one last thing to point out, you'll also see different ways of calling all of these layers right. So here I've written out
exactly what the layers are. conv3-64 means three by three convs with 64 total filters. But for VGGNet on this
diagram on the right here there's also common ways
that people will look at each group of filters, so each orange block here, as in conv1 part one, so conv1-1, conv1-2, and so on. So just something to keep in mind. So VGGNet ended up getting
second place in the ImageNet 2014 classification challenge, first in localization. They followed a very
similar training procedure as Alex Krizhevsky for the AlexNet. They didn't use local
response normalization, so as I mentioned earlier, they found out this
didn't really help them, and so they took it out. You'll see VGG 16 and VGG
19 are common variants of the cycle here, and this is just the number of layers, 19
is slightly deeper than 16. In practice VGG 19 works
very little bit better, and there's a little
bit more memory usage, so you can use either but
16 is very commonly used. For best results, like
AlexNet, they did ensembling in order to average several models, and you get better results. And they also showed in their work that the FC7 features of the last
fully connected layer before going to the 1000 ImageNet classes. The 4096 size layer just before that, is a good feature representation, that can even just be used as is, to extract these features from other data, and generalized these other tasks as well. And so FC7 is a good
feature representation. Yeah question. [student speaks off mic] - Sorry what was the question? Okay, so the question is
what is localization here? And so this is a task,
and we'll talk about it a little bit more in a later lecture on detection and localization
so I don't want to go into detail here but
it's basically an image, not just classifying What's
the class of the image, but also drawing a bounding
box around where that object is in the image. And the difference with detection, which is a very related
task is that detection there can be multiple instances
of this object in the image localization we're
assuming there's just one, this classification but we just how this additional bounding box. So we looked at VGG which
was one of the deep networks from 2014 and then now
we'll talk about GoogleNet which was the other one that won the classification challenge. So GoogleNet again was
a much deeper network with 22 layers but one
of the main insights and special things about
GoogleNet is that it really looked at this problem of
computational efficiency and it tried to design a
network architecture that was very efficient in the amount of compute. And so they did this using
this inception module which we'll go into more
detail and basically stacking a lot of these inception
modules on top of each other. There's also no fully connected
layers in this network, so they got rid of that
were able to save a lot of parameters and so in total
there's only five million parameters which is twelve
times less than AlexNet, which had 60 million even
though it's much deeper now. It got 6.7% top five error. So what's the inception module? So the idea behind the inception module is that they wanted to design
a good local network typology and it has this idea
of this local topology that's you know you can
think of it as a network within a network and
then stack a lot of these local typologies one on top of each other. And so in this local
network that they're calling an inception module what they're
doing is they're basically applying several different
kinds of filter operations in parallel on top of the
same input coming into this same layer. So we have our input coming
in from the previous layer and then we're going to do
different kinds of convolutions. So a one by one conv, right
a three by three conv, five by five conv, and then they also have a pooling operation
in this case three by three pooling, and so you get
all of these different outputs from these different layers, and then what they do is
they concatenate all these filter outputs together depth wise, and so then this creates one
tenser output at the end that is going tom pass
on to the next layer. So if we look at just a
naive way of doing this we just do exactly that we
have all of these different operations we get the outputs
we concatenate them together. So what's the problem with this? And it turns out that
computational complexity is going to be a problem here. So if we look more
carefully at an example, so here just for as an example
I've put one by one conv, 128 filter so three by
three conv 192 filters, five by five convs and 96 filters. Assume everything has basically the stride that's going to maintain
the spatial dimensions, and that we have this input coming in. So what is the output size
of the one by one filter with 128 , one by one
conv with 128 filters? Who has a guess? OK so I heard 28 by 28,
by 128 which is correct. So right by one by one conv
we're going to maintain spatial dimensions and
then on top of that, each conv filter is going to look through the entire 256 depth of the input, but then the output is going to be, we have a 28 by 28 feature map for each of the 128 filters that we have in this conv layer. So we get 28 by 28 by 128. OK and then now if we do the same thing and we look at the filter
sizes of the output sizes sorry of all of the different
filters here, after the three by three conv we're
going to have this volume of 28 by 28 by 192 right
after five by five conv we have 96 filters here. So 28 by 28 by 96, and then out pooling layer is just going to keep the same spatial
dimension here, so pooling layer will preserve it in depth, and here because of our stride, we're also going to preserve
our spatial dimensions. And so now if we look at
the output size after filter concatenation what we're
going to get is 28 by 28, these are all 28 by 28, and
we concatenating depth wise. So we get 28 by 28 times
all of these added together, and the total output size is going to be 28 by 28 by 672. So the input to our
inception module was 28 by 28 by 256, then the output
from this module is 28 by 28 by 672. So we kept the same spatial dimensions, and we blew up the depth. Question. [student speaks off mic] OK So in this case, yeah, the question is, how are we getting 28
by 28 for everything? So here we're doing all the zero padding in order to maintain
the spatial dimensions, and that way we can do this filter concatenation depth-wise. Question in the back. [student speaks off mic] - OK The question is what's
the 256 deep at the input, and so this is not the
input to the network, this is the input just
to this local module that I'm looking at. So in this case 256 is
the depth of the previous inception module that
came just before this. And so now coming out
we have 28 by 28 by 672, and that's going to be
the input to the next inception module. Question. [student speaks off mic] - Okay the question is, how
did we get 28 by 28 by 128 for the first one, the first conv, and this is basically it's a
one by one convolution right, so we're going to take
this one by one convolution slide it across our 28 by
28 by 256 input spatially where it's at each location,
it's going to multiply, it's going to do a [mumbles] through the entire 256
depth, and so we do this one by one conv slide it over spatially and we get a feature map
out that's 28 by 28 by one. There's one number at each
spatial location coming out, and each filter produces
one of these 28 by 28 by one maps, and we have
here a total 128 filters, and that's going to
produce 28 by 28, by 128. OK so if you look at
the number of operations that are happening in
the convolutional layer, let's look at the first one for
example this one by one conv as I was just saying at each
each location we're doing a one by one by 256 dot product. So there's 256 multiply
operations happening here and then for each filter
map we have 28 by 28 spatial locations, so
that's the first 28 times 28 first two numbers that
are multiplied here. These are the spatial
locations for each filter map, and so we have to do this
to 25 60 multiplication each one of these then
we have 128 total filters at this layer, or we're
producing 128 total feature maps. And so the total number
of these operations here is going to be 28 times 28 times 128 times 256. And so this is going to be the same for, you can think about this
for the three by three conv, and the five by five conv,
that's exactly the same principle. And in total we're going to
get 854 million operations that are happening here. - [Student] And the 128,
192, and 96 are just values [mumbles] - Question the 128, 192 and
256 are values that I picked. Yes, these are not values
that I just came up with. They are similar to the
ones that you will see in like a particular
layer of inception net, so in GoogleNet basically,
each module has a different set of these kinds of
parameters, and I picked one that was similar to one of these. And so this is very expensive
computationally right, these these operations. And then the other thing
that I also want to note is that the pooling layer also
adds to this problem because it preserves the whole feature depth. So at every layer your total
depth can only grow right, you're going to take
the full featured depth from your pooling layer, as
well as all the additional feature maps from the conv
layers and add these up together. So here our input was 256
depth and our output is 672 depth and you're just
going to keep increasing this as you go up. So how do we deal with this
and how do we keep this more manageable? And so one of the key
insights that GoogleNet used was that well we can we
can address this by using bottleneck layers and try and
project these feature maps to lower dimension before our
our convolutional operations, so before our expensive layers. And so what exactly does that mean? So reminder one by one
convolution, I guess we were just going through
this but it's taking your input volume, it's performing a
dot product at each spatial location and what it does is
it preserves spatial dimension but it reduces the depth and
it reduces that by projecting your input depth to a lower dimension. It just takes it's basically
like a linear combination of your input feature maps. And so this main idea is
that it's projecting your depth down and so the inception module takes these one by one convs
and adds these at a bunch of places in these modules
where there's going to be, in order to alleviate
this expensive compute. So before the three by three
and five by five conv layers, it puts in one of these
one by one convolutions. And then after the
pooling layer it also puts an additional one by one convolution. Right so these are the one
by one bottleneck layers that are added in. And so how does this change the math that we were looking at earlier? So now basically what's
happening is that we still have the same input here 28 by 28 by 256, but these one by one convs
are going to reduce the depth dimension and so you can see
before the three by three convs, if I put a one by
one conv with 64 filters, my output from that is going to be, 28 by 28 by 64. So instead of now going into
the three by three convs afterwards instead of 28
by 28 by 256 coming in, we only have a 28 by 28,
by 64 block coming in. And so this is now
reducing the smaller input going into these conv
layers, the same thing for the five by five conv, and
then for the pooling layer, after the pooling comes
out, we're going to reduce the depth after this. And so, if you work out
the math the same way for all of the convolutional ops here, adding in now all these one by one convs on top of the three by
threes and five by fives, the total number of operations
is 358 million operations, so it's much less than the
854 million that we had in the naive version, and
so you can see how you can use this one by one
conv, and the filter size for that to control your computation. Yes, question in the back. [student speaks off mic] - Yes, so the question
is, have you looked into what information might be
lost by doing this one by one conv at the beginning. And so there might be
some information loss, but at the same time if
you're doing these projections you're taking a linear
combination of these input feature maps which has redundancy in them, you're taking combinations of them, and you're also introducing
an additional non-linearity after the one by one
conv, so it also actually helps in that way with
adding a little bit more depth and so, I don't think
there's a rigorous analysis of this, but basically in
general this works better and there's reasons why it helps as well. OK so here we have, we're
basically using these one by one convs to help manage our
computational complexity, and then what GooleNet
does is it takes these inception modules and it's going to stack all these together. So this is a full inception architecture. And if we look at this a
little bit more detail, so here I've flipped it, because it's so big, it's not going to fit vertically any more on the slide. So what we start with is
we first have this stem network, so this is more
the kind of vanilla plain conv net that we've seen earlier [mumbles] six sequence of layers. So conv pool a couple
of convs in another pool just to get started and then after that we have all of our different
our multiple inception modules all stacked on top of each other, and then on top we have
our classifier output. And notice here that
they've really removed the expensive fully connected layers it turns out that the model
works great without them, even and you reduce a lot of parameters. And then what they also have here is, you can see these couple
of extra stems coming out and these are auxiliary
classification outputs and so these are also you know
just a little mini networks with an average pooling,
a one by one conv, a couple of fully connected
layers here going to the soft Max and also a 1000 way SoftMax with the ImageNet classes. And so you're actually
using your ImageNet training classification loss in
three separate places here. The standard end of the
network, as well as in these two places earlier on in
the network, and the reason they do that is just
this is a deep network and they found that having
these additional auxiliary classification outputs,
you get more gradient training injected at the earlier layers, and so more just helpful signal flowing in because these intermediate
layers should also be helpful. You should be able to do classification based off some of these as well. And so this is the full architecture, there's 22 total layers
with weights and so within each of these modules
each of those one by one, three by three, five by
five is a weight layer, just including all of
these parallel layers, and in general it's a relatively
more carefully designed architecture and part of this
is based on some of these intuitions that we're talking
about and part of them also is just you know
Google the authors they had huge clusters and they're
cross validating across all kinds of design
choices and this is what ended up working well. Question? [student speaks off mic] - Yeah so the question is,
are the auxiliary outputs actually useful for the
final classification, to use these as well? I think when they're training them they do average all these
for the losses coming out. I think they are helpful. I can't remember if in
the final architecture, whether they average all
of these or just take one, it seems very possible that
they would use all of them, but you'll need to check on that. [student speaks off mic] - So the question is for
the bottleneck layers, is it possible to use some
other types of dimensionality reduction and yes you can use
other kinds of dimensionality reduction. The benefits here of
this one by one conv is, you're getting this effect,
but it's all, you know it's a conv layer just like any other. You have the soul network of these, you just train it this full network back [mumbles] through everything, and it's learning how to combine the previous feature maps. Okay yeah, question in the back. [student speaks off mic] - Yes so, question is
are any weights shared or all they all separate and yeah, all of these layers have separate weights. Question. [student speaks off mic] - Yes so the question is why do we have to inject gradients at earlier layers? So our classification
output at the very end, where we get a gradient on this, it's passed all the way back
through the chain roll but the problem is when
you have very deep networks and you're going all the
way back through these, some of this gradient
signal can become minimized and lost closer to the beginning,
and so that's why having these additional ones in earlier parts can help provide some additional signal. [student mumbles off mic] - So the question is are you
doing back prop all the times for each output. No it's just one back
prop all the way through, and you can think of these three, you can think of there being kind of like an addition at the end
of these if you were to draw up your computational
graph, and so you get your final signal and you can
just take all of these gradients and just back plot
them all the way through. So it's as if they were
added together at the end in a computational graph. OK so in the interest of
time because we still have a lot to get through, can
take other questions offline. Okay so GoogleNet basically 22 layers. It has an efficient inception module, there's no fully connected layers. 12 times fewer parameters than AlexNet, and it's the ILSVRC 2014
classification winner. And so now let's look at the 2015 winner, which is the ResNet network and so here this idea is really, this
revolution of depth net right. We were starting to increase
depth in 2014, and here we've just had this hugely
deeper model at 152 layers was the ResNet architecture. And so now let's look at that
in a little bit more detail. So the ResNet architecture,
is getting extremely deep networks, much deeper
than any other networks before and it's doing this using this idea of residual connections
which we'll talk about. And so, they had 152
layer model for ImageNet. They were able to get 3.5
of 7% top 5 error with this and the really special
thing is that they swept all classification and
detection contests in the ImageNet mart benchmark
and this other benchmark called COCO. It just basically won everything. So it was just clearly
better than everything else. And so now let's go into a
little bit of the motivation behind ResNet and residual connections that we'll talk about. And the question that they
started off by trying to answer is what happens when we try
and stack deeper and deeper layers on a plain
convolutional neural network? So if we take something like VGG or some normal network that's
just stacks of conv and pool layers on top of each
other can we just continuously extend these, get deeper
layers and just do better? And and the answer is no. So if you so if you look at what happens when you get deeper, so here
I'm comparing a 20 layer network and a 56 layer network
and so this is just a plain kind of network you'll see
that in the test error here on the right the 56 layer
network is doing worse than the 28 layer network. So the deeper network was
not able to do better. But then the really weird thing is now if you look at the training error right we here have again the 20 layer network and a 56 layer network. The 56 layer network, one of
the obvious problems you think, I have a really deep network,
I have tons of parameters maybe it's probably starting
to over fit at some point. But what actually happens is
that when you're over fitting you would expect to have very good, very low training error rate,
and just bad test error, but what's happening here is
that in the training error the 56 layer network is
also doing worse than the 20 layer network. And so even though the
deeper model performs worse, this is not caused by over-fitting. And so the hypothesis
of the ResNet creators is that the problem is actually
an optimization problem. Deeper models are just harder to optimize, than more shallow networks. And the reasoning was that well, a deeper model should be
able to perform at least as well as a shallower model. You can have actually a
solution by construction where you just take the learned layers from your shallower model, you just copy these over and then
for the remaining additional deeper layers you just
add identity mappings. So by construction this
should be working just as well as the shallower layer. And your model that weren't
able to learn properly, it should be able to learn at least this. And so motivated by
this their solution was well how can we make it
easier for our architecture, our model to learn these
kinds of solutions, or at least something like this? And so their idea is well
instead of just stacking all these layers on top
of each other and having every layer try and learn
some underlying mapping of a desired function, lets
instead have these blocks, where we try and fit a residual mapping, instead of a direct mapping. And so what this looks
like is here on this right where the input to these block
is just the input coming in and here we are going to
use our, here on the side, we're going to use our
layers to try and fit some residual of our desire to H of X, minus X instead of the desired
function H of X directly. And so basically at the
end of this block we take the step connection on
this right here, this loop, where we just take our input,
we just use pass it through as an identity, and so if
we had no weight layers in between it was just
going to be the identity it would be the same thing
as the output, but now we use our additional weight
layers to learn some delta, for some residual from our X. And so now the output
of this is going to be just our original R X plus some residual that we're going to call it. It's basically a delta
and so the idea is that now the output it should
be easy for example, in the case where identity is ideal, to just squash all of
these weights of F of X from our weight layers
just set it to all zero for example, then we're
just going to get identity as the output, and we can get something, for example, close to this
solution by construction that we had earlier. Right, so this is just
a network architecture that says okay, let's try and fit this, learn how our weight layers
residual, and be something close, that way it'll more
likely be something close to X, it's just modifying X,
than to learn exactly this full mapping of what it should be. Okay, any questions about this? [student speaks off mic] - Question is is there the same dimension? So yes these two paths
are the same dimension. In general either it's the same dimension, or what they actually
do is they have these projections and shortcuts
and they have different ways of padding to make things work
out to be the same dimension. Depth wise. Yes - [Student] When you use the word residual you were talking about [mumbles off mic] - So the question is what
exactly do we mean by residual this output
of this transformation is a residual? So we can think of our output
here right as this F of X plus X, where F of X is the
output of our transformation and then X is our input,
just passed through by the identity. So we'd like to using a plain layer, what we're trying to do is learn something like H of X, but what we saw
earlier is that it's hard to learn H of X. It's a good H of X as we
get very deep networks. And so here the idea is
let's try and break it down instead of as H of X is
equal to F of X plus, and let's just try and learn F of X. And so instead of learning
directly this H of X we just want to learn what
is it that we need to add or subtract to our input as
we move on to the next layer. So you can think of it as
kind of modifying this input, in place in a sense. We have-- [interrupted by student mumbling off mic] - The question is, when we're
saying the word residual are we talking about F of X? Yeah. So F of X is what we're
calling the residual. And it just has that meaning. Yes another question. [student mumbles off mic] - So the question is in
practice do we just sum F of X and X together, or
do we learn some weighted combination and you just do a direct sum. Because when you do a direct sum, this is the idea of let
me just learn what is it I have to add or subtract onto X. Is this clear to everybody,
the main intuition? Question. [student speaks off mic] - Yeah, so the question
is not clear why is it that learning the
residual should be easier than learning the direct mapping? And so this is just their hypotheses, and a hypotheses is that if
we're learning the residual you just have to learn
what's the delta to X right? And if our hypotheses is that generally even something like our
solution by construction, where we had some number
of these shallow layers that were learned and we had
all these identity mappings at the top this was a
solution that should have been good, and so that implies that
maybe a lot of these layers, actually something just close to identity, would be a good layer And so because of that,
now we formulate this as being able to learn the identity plus just a little delta. And if really the identity
is best we just make F of X squashes transformation
to just be zero, which is something that's relatively, might seem easier to learn, also we're able to get
things that are close to identity mappings. And so again this is not
something that's necessarily proven or anything it's just
the intuition and hypothesis, and then we'll also see
later some works where people are actually trying to
challenge this and say oh maybe it's not actually the residuals
that are so necessary, but at least this is the
hypothesis for this paper, and in practice using this model, it was able to do very well. Question. [student speaks off mic] - Yes so the question is
have people tried other ways of combining the inputs
from previous layers and yes so this is basically a very
active area of research on and how we formulate
all these connections, and what's connected to what
in all of these structures. So we'll see a few more
examples of different network architectures briefly later
but this is an active area of research. OK so we basically have all
of these residual blocks that are stacked on top of each other. We can see the full resident architecture. Each of these residual blocks
has two three by three conv layers as part of this block
and there's also been work just saying that this happens
to be a good configuration that works well. We stack all these blocks
together very deeply. Another thing like with
this very deep architecture it's basically also
enabling up to 150 layers deep of this, and then
what we do is we stack all these and periodically we also double the number of filters
and down sample spatially using stride two when we do that. And then we have this additional [mumbles] at the very beginning of our network and at the end we also hear, don't have any fully connected layers and we just have a global
average pooling layer that's going to average
over everything spatially, and then be input into the
last 1000 way classification. So this is the full ResNet architecture and it's very simple and
elegant just stacking up all of these ResNet blocks
on top of each other, and they have total depths
of up to 34, 50, 100, and they tried up to 152 for ImageNet. OK so one additional
thing just to know is that for a very deep network,
so the ones that are more than 50 layers deep, they
also use bottleneck layers similar to what GoogleNet did
in order to improve efficiency and so within each block
now you're going to, what they did is, have this
one by one conv filter, that first projects it
down to a smaller depth. So again if we are looking
at let's say 28 by 28 by 256 implant, we do
this one by one conv, it's taking it's
projecting the depth down. We get 28 by 28 by 64. Now your convolution
your three by three conv, in here they only have
one, is operating over this reduced step so it's going
to be less expensive, and then afterwards they have another one by one conv that
projects the depth back up to 256, and so, this is
the actual block that you'll see in deeper networks. So in practice the ResNet
also uses batch normalization after every conv layer, they
use Xavier initialization with an extra scaling factor
that they helped introduce to improve the initialization
trained with SGD + momentum. Their learning rate they
use a similar learning rate type of schedule where you
decay your learning rate when your validation error plateaus. Mini batch size 256, a
little bit of weight decay and no drop out. And so experimentally they
were able to show that they were able to train these
very deep networks, without degrading. They were able to have
basically good gradient flow coming all the way back
down through the network. They tried up to 152 layers on ImageNet, 1200 on Cifar, which is a,
you have played with it, but a smaller data set
and they also saw that now you're deeper networks are
able to achieve lower training errors as expected. So you don't have the same strange plots that we saw earlier where the behavior was in the wrong direction. And so from here they were
able to sweep first place at all of the ILSVRC competitions, and all of the COCO competitions in 2015 by a significant margins. Their total top five error
was 3.6 % for a classification and this is actually better
than human performance in the ImageNet paper. There was also a human
metric that came from actually [mumbles] our
lab Andre Kapathy spent like a week training
himself and then basically did all of, did this task himself and was I think somewhere around 5-ish %, and so I was basically able to do better than the then that human at least. Okay, so these are kind
of the main networks that have been used recently. We had AlexNet starting off with first, VGG and GoogleNet are still very popular, but ResNet is the most
recent best performing model that if you're looking for
something training a new network ResNet is available, you should try working with it. So just quickly looking at
some of this getting a better sense of the complexity involved. So here we have some
plots that are sorted by performance so this is
top one accuracy here, and higher is better. And so you'll see a lot
of these models that we talked about, as well as
some different versions of them so, this
GoogleNet inception thing, I think there's like V2,
V3 and the best one here is V4, which is actually
a ResNet plus inception combination, so these are just kind of more incremental, smaller
changes that they've built on top of them,
and so that's the best performing model here. And if we look on the
right, these plots of their computational complexity here it's sorted. The Y axis is your top one accuracy so higher is better. The X axis is your operations
and so the more to the right, the more ops you're doing,
the more computationally expensive and then the bigger the circle, your circle is your memory usage, so the gray circles are referenced here, but the bigger the circle
the more memory usage and so here we can see
that VGG these green ones are kind of the least efficient. They have the biggest memory, the most operations, but they they do pretty well. GoogleNet is the most efficient here. It's way down on the operation side, as well as a small little
circle for memory usage. AlexNet, our earlier
model, has lowest accuracy. It's relatively smaller compute, because it's a smaller network, but
it's also not particularly memory efficient. And then ResNet here, we
have moderate efficiency. It's kind of in the middle,
both in terms of memory and operations, and it
has the highest accuracy. And so here also are
some additional plots. You can look at these
more on your own time, but this plot on the left is
showing the forward pass time and so this is in milliseconds
and you can up at the top VGG forward passes about 200
milliseconds you can get about five frames per second with this, and this is sorted in order. There's also this plot on
the right looking at power consumption and if you look
more at this paper here, there's further analysis of
these kinds of computational comparisons. So these were the main
architectures that you should really know in-depth and be familiar with, and be thinking about actively using. But now I'm going just
to go briefly through some other architectures
that are just good to know either historical inspirations or more recent areas of research. So the first one Network in Network, this is from 2014, and
the idea behind this is that we have these
vanilla convolutional layers but we also have these,
this introduces the idea of MLP conv layers they call
it, which are micro networks or basically network within networth, the name of the paper. Where within each conv
layer trying to stack an MLP with a couple of fully
connected layers on top of just the standard conv
and be able to compute more abstract features for these local patches right. So instead of sliding
just a conv filter around, it's sliding a slightly
more complex hierarchical set of filters around
and using that to get the activation maps. And so, it uses these fully connected, or basically one by one
conv kind of layers. It's going to stack them all up like the bottom diagram here where
we just have these networks within networks stacked
in each of the layers. And the main reason to know this is just it was kind of a precursor
to GoogleNet and ResNet in 2014 with this idea
of bottleneck layers that you saw used very heavily in there. And it also had a little bit
of philosophical inspiration for GoogleNet for this idea
of a local network typology network in network that they also used, with a different kind of structure. Now I'm going to talk
about a series of works, on, or works since ResNet
that are mostly geared towards improving resNet
and so this is more recent research has been done since then. I'm going to go over these pretty fast, and so just at a very high level. If you're interested in
any of these you should look at the papers, to have more details. So the authors of ResNet
a little bit later on in 2016 also had this paper
where they improved the ResNet block design. And so they basically
adjusted what were the layers that were in the ResNet block path, and showed this new
structure was able to have a more direct path in order
for propagating information throughout the network,
and you want to have a good path to propagate
information all the way up, and then back up all the way down again. And so they showed that this
new block was better for that and was able to give better performance. There's also a Wide Residual
networks which this paper argued that while ResNets
made networks much deeper as well as added these
residual connections and their argument was
that residuals are really the important factor. Having this residual construction, and not necessarily having
extremely deep networks. And so what they did was they
used wider residual blocks, and so what this means is
just more filters in every conv layer. So before we might have
F filters per layer and they use these factors
of K and said well, every layer it's going to be
F times K filters instead. And so, using these
wider layers they showed that their 50 layer wide
ResNet was able to out-perform the 152 layer original ResNet, and it also had the
additional advantages of increasing with this,
even with the same amount of parameters, tit's more
computationally efficient because you can parallelize
these with operations more easily. Right just convolutions with more neurons just spread across more kernels as opposed to depth
that's more sequential, so it's more computationally
efficient to increase your width. So here you can see
this work is starting to trying to understand the
contributions of width and depth and residual connections, and making some arguments
for one way versus the other. And this other paper around the same time, I think maybe a little
bit later, is ResNeXt, and so this is again,
the creators of ResNet continuing to work on
pushing the architecture. And here they also had
this idea of okay, let's indeed tackle this width
thing more but instead of just increasing the width
of this residual block through more filters they have structure. And so within each residual
block, multiple parallel pathways and they're going to call the total number of these
pathways the cardinality. And so it's basically
taking the one ResNet block with the bottlenecks and having
it be relatively thinner, but having multiple of
these done in parallel. And so here you can also
see that this both have some relation to this idea of wide networks, as well as to has some connection
to the inception module as well right where we
have these parallel, these layers operating in parallel. And so now this ResNeXt has
some flavor of that as well. So another approach
towards improving ResNets was this idea called Stochastic
Depth and in this work the motivation is well let's look more at this depth problem. Once you get deeper and
deeper the typical problems that you're going to have
vanishing gradients right. You're not able to, your
gradients will get smaller and eventually vanish as
you're trying to back propagate them over very long layers,
or a large number of layers. And so what their motivation
is well let's try to have short networks during training
and they use this idea of dropping out a subset of
the layers during training. And so for a subset of the
layers they just drop out the weights and they just set
it to identity connection, and now what you get is you
have these shorter networks during training, you can pass back your gradients better. It's also a little more
efficient, and then it's kind of like the drop out right. It has this sort of flavor
that you've seen before. And then at test time you want
to use the full deep network that you've trained. So these are some of the
works that looking at the resident architecture, trying
to understand different aspects of it and trying
to improve ResNet training. And so there's also some
works now that are going beyond ResNet that are
saying well what are some non ResNet architectures that
maybe can also work better, or comparable or better to ResNets. And so one idea is
FractalNet, which came out pretty recently, and the
argument in FractalNet is that while residual
representations maybe are not actually necessary,
so this goes back to what we were talking about earlier. What's the motivation of
residual networks and it seems to make sense and there's, you know, good reasons for why this
should help but in this paper they're saying that well here
is a different architecture that we're introducing, there's
no residual representations. We think that the key is
more about transitioning effectively from shallow to deep networks, and so they have this fractal architecture which has if you look on the right here, these layers where they compose
it in this fractal fashion. And so there's both
shallow and deep pathways to your output. And so they have these
different length pathways, they train them with
dropping out sub paths, and so again it has this
dropout kind of flavor, and then at test time they'll
use the entire fractal network and they show that this was able to get very good performance. There's another idea
called Densely Connected convolutional Networks,
DenseNet, and this idea is now we have these
blocks that are called dense blocks. And within each block
each layer is going to be connected to every other layer after it, in this feed forward fashion. So within this block,
your input to the block is also the input to
every other conv layer, and as you compute each conv output, those outputs are now connected to every layer after and then,
these are all concatenated as input to the conv
layer, and they do some they have some other
processes for reducing the dimensions and keeping efficient. And so their main takeaway from this, is that they argue that
this is alleviating a vanishing gradient problem
because you have all of these very dense connections. It strengthens feature propagation
and then also encourages future use right because
there are so many of these connections each feature
map that you're learning is input in multiple
later layers and being used multiple times. So these are just a
couple of ideas that are you know alternatives or
what can we do that's not ResNets and yet is still performing either comparably or better to
ResNets and so this is another very active area
of current research. You can see that a lot of this is looking at the way how different layers
are connected to each other and how depth is managed
in these networks. And so one last thing
that I wanted to mention quickly, is just efficient networks. So this idea of efficiency
and you saw that GoogleNet was a work that was
looking into this direction of how can we have efficient
networks which are important for you know a lot of
practical usage both training as well as especially
deployment and so this is another recent network
that's called SqueezeNet which is looking at
very efficient networks. They have these things
called fire modules, which consists of a
squeeze layer with a lot of one by one filters and
then this feeds then into an expand layer with one by
one and three by three filters, and they're showing that with
this kind of architecture they're able to get AlexNet
level accuracy on ImageNet, but with 50 times fewer parameters, and then you can further do
network compression on this to get up to 500 times
smaller than AlexNet and just have the whole
network just be 0.5 megs. And so this is a direction
of how do we have efficient networks model compression that we'll cover more in a lecture later, but just giving you a hint of that. OK so today in summary we've
talked about different kinds of CNN Architectures. We looked in-depth at four
of the main architectures that you'll see in wide usage. AlexNet, one of the early,
very popular networks. VGG and GoogleNet which
are still widely used. But ResNet is kind of
taking over as the thing that you should be
looking most when you can. We also looked at these other networks in a little bit more depth at a brief level overview. And so the takeaway that these
models that are available they're in a lot of
[mumbles] so you can use them when you need them. There's a trend toward
extremely deep networks, but there's also significant
research now around the design of how do we connect layers, skip connections, what
is connected to what, and also using these to
design your architecture to improve gradient flow. There's an even more recent
trend towards examining what's the necessity
of depth versus width, residual connections. Trade offs, what's
actually helping matters, and so there's a lot of these recent works in this direction that you can look into some of the ones I pointed
out if you are interested. And next time we'll talk about
Recurrent neural networks. Thanks.