Lecture 9 | CNN Architectures

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
- All right welcome to lecture nine. So today we will be talking about CNN Architectures. And just a few administrative points before we get started, assignment two is due Thursday. The mid term will be in class on Tuesday May ninth, so next week and it will cover material through Tuesday through this coming Thursday May fourth. So everything up to recurrent neural networks are going to be fair game. The poster session we've decided on a time, it's going to be Tuesday June sixth from twelve to three p.m. So this is the last week of classes. So we have our our poster session a little bit early during the last week so that after that, once you guys get feedback you still have some time to work for your final report which will be due finals week. Okay, so just a quick review of last time. Last time we talked about different kinds of deep learning frameworks. We talked about you know PyTorch, TensorFlow, Caffe2 and we saw that using these kinds of frameworks we were able to easily build big computational graphs, for example very large neural networks and comm nets, and be able to really easily compute gradients in these graphs. So to compute all of the gradients for all the intermediate variables weights inputs and use that to train our models and to run all this efficiently on GPUs And we saw that for a lot of these frameworks the way this works is by working with these modularized layers that you guys have been working writing with, in your home works as well where we have a forward pass, we have a backward pass, and then in our final model architecture, all we need to do then is to just define all of these sequence of layers together. So using that we're able to very easily be able to build up very complex network architectures. So today we're going to talk about some specific kinds of CNN Architectures that are used today in cutting edge applications and research. And so we'll go into depth in some of the most commonly used architectures for these that are winners of ImageNet classification benchmarks. So in chronological order AlexNet, VGG net, GoogLeNet, and ResNet. And so these will go into a lot of depth. And then I'll also after that, briefly go through some other architectures that are not as prominently used these days, but are interesting either from a historical perspective, or as recent areas of research. Okay, so just a quick review. We talked a long time ago about LeNet, which was one of the first instantiations of a comNet that was successfully used in practice. And so this was the comNet that took an input image, used com filters five by five filters applied at stride one and had a couple of conv layers, a few pooling layers and then some fully connected layers at the end. And this fairly simple comNet was very successfully applied to digit recognition. So AlexNet from 2012 which you guys have also heard already before in previous classes, was the first large scale convolutional neural network that was able to do well on the ImageNet classification task so in 2012 AlexNet was entered in the competition, and was able to outperform all previous non deep learning based models by a significant margin, and so this was the comNet that started the spree of comNet research and usage afterwards. And so the basic comNet AlexNet architecture is a conv layer followed by pooling layer, normalization, com pool norm, and then a few more conv layers, a pooling layer, and then several fully connected layers afterwards. So this actually looks very similar to the LeNet network that we just saw. There's just more layers in total. There is five of these conv layers, and two fully connected layers before the final fully connected layer going to the output classes. So let's first get a sense of the sizes involved in the AlexNet. So if we look at the input to the AlexNet this was trained on ImageNet, with inputs at a size 227 by 227 by 3 images. And if we look at this first layer which is a conv layer for the AlexNet, it's 11 by 11 filters, 96 of these applied at stride 4. So let's just think about this for a moment. What's the output volume size of this first layer? And there's a hint. So remember we have our input size, we have our convolutional filters, ray. And we have this formula, which is the hint over here that gives you the size of the output dimensions after applying com right? So remember it was the full image, minus the filter size, divided by the stride, plus one. So given that that's written up here for you 55, does anyone have a guess at what's the final output size after this conv layer? [student speaks off mic] - So I had 55 by 55 by 96, yep. That's correct. Right so our spatial dimensions at the output are going to be 55 in each dimension and then we have 96 total filters so the depth after our conv layer is going to be 96. So that's the output volume. And what's the total number of parameters in this layer? So remember we have 96 11 by 11 filters. [student speaks off mic] - [Lecturer] 96 by 11 by 11, almost. So yes, so I had another by three, yes that's correct. So each of the filters is going to see through a local region of 11 by 11 by three, right because the input depth was three. And so, that's each filter size, times we have 96 of these total. And so there's 35K parameters in this first layer. Okay, so now if we look at the second layer this is a pooling layer right and in this case we have three three by three filters applied at stride two. So what's the output volume of this layer after pooling? And again we have a hint, very similar to the last question. Okay, 27 by 27 by 96. Yes that's correct. Right so the pooling layer is basically going to use this formula that we had here. Again because these are pooling applied at a stride of two so we're going to use the same formula to determine the spatial dimensions and so the spatial dimensions are going to be 27 by 27, and pooling preserves the depth. So we had 96 as depth as input, and it's still going to be 96 depth at output. And next question. What's the number of parameters in this layer? I hear some muttering. [student answers off mic] - Nothing. Okay. Yes, so pooling layer has no parameters, so, kind of a trick question. Okay, so we can basically, yes, question? [student speaks off mic] - The question is, why are there no parameters in the pooling layer? The parameters are the weights right, that we're trying to learn. And so convolutional layers have weights that we learn but pooling all we do is have a rule, we look at the pooling region, and we take the max. So there's no parameters that are learned. So we can keep on doing this and you can just repeat the process and it's kind of a good exercise to go through this and figure out the sizes, the parameters, at every layer. And so if you do this all the way, you can look at this is the final architecture that you can work with. There's 11 by 11 filters at the beginning, then five by five and some three by three filters. And so these are generally pretty familiar looking sizes that you've seen before and then at the end we have a couple of fully connected layers of size 4096 and finally the last layer, is FC8 going to the soft max, which is going to the 1000 ImageNet classes. And just a couple of details about this, it was the first use of the ReLu non-linearity that we've talked about that's the most commonly used non-linearity. They used local response normalization layers basically trying to normalize the response across neighboring channels but this is something that's not really used anymore. It turned out not to, other people showed that it didn't have so much of an effect. There's a lot of heavy data augmentation, and so you can look in the paper for more details, but things like flipping, jittering, cropping, color normalization all of these things which you'll probably find useful for you when you're working on your projects for example, so a lot of data augmentation here. They also use dropout batch size of 128, and learned with SGD with momentum which we talked about in an earlier lecture, and basically just started with a base learning rate of 1e negative 2. Every time it plateaus, reduce by a factor of 10 and then just keep going. Until they finish training and a little bit of weight decay and in the end, in order to get the best numbers they also did an ensembling of models and so training multiple of these, averaging them together and this also gives an improvement in performance. And so one other thing I want to point out is that if you look at this AlexNet diagram up here, it looks kind of like the normal comNet diagrams that we've been seeing, except for one difference, which is that it's, you can see it's kind of split in these two different rows or columns going across. And so the reason for this is mostly historical note, so AlexNet was trained on GTX580 GPUs older GPUs that only had three gigs of memory. So it couldn't actually fit this entire network on here, and so what they ended up doing, was they spread the network across two GPUs. So on each GPU you would have half of the neurons, or half of the feature maps. And so for example if you look at this first conv layer, we have 55 by 55 by 96 output, but if you look at this diagram carefully, you can zoom in later in the actual paper, you can see that, it's actually only 48 depth-wise, on each GPU, and so they just spread it, the feature maps, directly in half. And so what happens is that for most of these layers, for example com one, two, four and five, the connections are only with feature maps on the same GPU, so you would take as input, half of the feature maps that were on the the same GPU as before and you don't look at the full 96 feature maps for example. You just take as input the 48 in that first layer. And then there's a few layers so com three, as well as FC six, seven and eight, where here are the GPUs do talk to each other and so there's connections with all feature maps in the preceding layer. so there's communication across the GPUs, and each of these neurons are then connected to the full depth of the previous input layer. Question. - [Student] It says the full simplified AlexNetwork architecture. [mumbles] - Oh okay, so the question is why does it say full simplified AlexNet architecture here? It just says that because I didn't put all the details on here, so for example this is the full set of layers in the architecture, and the strides and so on, but for example the normalization layer, there's other, these details are not written on here. And then just one little note, if you look at the paper and try and write out the math and architectures and so on, there's a little bit of an issue on the very first layer they'll say if you'll look in the figure they'll say 224 by 224 , but there's actually some kind of funny pattern going on and so the numbers actually work out if you look at it as 227. AlexNet was the winner of the ImageNet classification benchmark in 2012, you can see that it cut the error rate by quite a large margin. It was the first CNN base winner, and it was widely used as a base to our architecture almost ubiquitously from then until a couple years ago. It's still used quite a bit. It's used in transfer learning for lots of different tasks and so it was used for basically a long time, and it was very famous and now though there's been some more recent architectures that have generally just had better performance and so we'll talk about these next and these are going to be the more common architectures that you'll be wanting to use in practice. So just quickly first in 2013 the ImageNet challenge was won by something called a ZFNet. Yes, question. [student speaks off mic] - So the question is intuition why AlexNet was so much better than the ones that came before, DefLearning comNets [mumbles] this is just a very different kind of approach in architecture. So this was the first deep learning based approach first comNet that was used. So in 2013 the challenge was won by something called a ZFNet [Zeller Fergus Net] named after the creators. And so this mostly was improving hyper parameters over the AlexNet. It had the same number of layers, the same general structure and they made a few changes things like changing the stride size, different numbers of filters and after playing around with these hyper parameters more, they were able to improve the error rate. But it's still basically the same idea. So in 2014 there are a couple of architectures that were now more significantly different and made another jump in performance, and the main difference with these networks first of all was much deeper networks. So from the eight layer network that was in 2012 and 2013, now in 2014 we had two very close winners that were around 19 layers and 22 layers. So significantly deeper. And the winner of this was GoogleNet, from Google but very close behind was something called VGGNet from Oxford, and on actually the localization challenge VGG got first place in some of the other tracks. So these were both very, very strong networks. So let's first look at VGG in a little bit more detail. And so the VGG network is the idea of much deeper networks and with much smaller filters. So they increased the number of layers from eight layers in AlexNet right to now they had models with 16 to 19 layers in VGGNet. And one key thing that they did was they kept very small filter so only three by three conv all the way, which is basically the smallest com filter size that is looking at a little bit of the neighboring pixels. And they just kept this very simple structure of three by three convs with the periodic pooling all the way through the network. And it's very simple elegant network architecture, was able to get 7.3% top five error on the ImageNet challenge. So first the question of why use smaller filters. So when we take these small filters now we have fewer parameters and we try and stack more of them instead of having larger filters, have smaller filters with more depth instead, have more of these filters instead, what happens is that you end up having the same effective receptive field as if you only have one seven by seven convolutional layer. So here's a question, what is the effective receptive field of three of these three by three conv layers with stride one? So if you were to stack three three by three conv layers with Stride one what's the effective receptive field, the total area of the input, spatial area of the input that enure at the top layer of the three layers is looking at. So I heard fifteen pixels, why fifteen pixels? - [Student] Okay, so the reason given was because they overlap-- - Okay, so the reason given was because they overlap. So it's on the right track. What actually is happening though is you have to see, at the first layer, the receptive field is going to be three by three right? And then at the second layer, each of these neurons in the second layer is going to look at three by three other first layer filters, but the corners of these three by three have an additional pixel on each side, that is looking at in the original input layer. So the second layer is actually looking at five by five receptive field and then if you do this again, the third layer is looking at three by three in the second layer but this is going to, if you just draw out this pyramid is looking at seven by seven in the input layer. So the effective receptive field here is going to be seven by seven. Which is the same as one seven by seven conv layer. So what happens is that this has the same effective receptive field as a seven by seven conv layer but it's deeper. It's able to have more non-linearities in there, and it's also fewer parameters. So if you look at the total number of parameters, each of these conv filters for the three by threes is going to have nine parameters in each conv [mumbles] three times three, and then times the input depth, so three times three times C, times this total number of output feature maps, which is again C is we're going to preserve the total number of channels. So you get three times three, times C times C for each of these layers, and we have three layers so it's going to be three times this number, compared to if you had a single seven by seven layer then you get, by the same reasoning, seven squared times C squared. So you're going to have fewer parameters total, which is nice. So now if we look at this full network here there's a lot of numbers up here that you can go back and look at more carefully but if we look at all of the sizes and number of parameters the same way that we calculated the example for AlexNet, this is a good exercise to go through, we can see that you know going the same way we have a couple of these conv layers and a pooling layer a couple more conv layers, pooling layer, several more conv layers and so on. And so this just keeps going up. And if you counted the total number of convolutional and fully connected layers, we're going to have 16 in this case for VGG 16, and then VGG 19, it's just a very similar architecture, but with a few more conv layers in there. And so the total memory usage of this network, so just making a forward pass through counting up all of these numbers so in the memory numbers here written in terms of the total numbers, like we calculated earlier, and if you look at four bytes per number, this is going to be about 100 megs per image, and so this is the scale of the memory usage that's happening and this is only for a forward pass right, when you do a backward pass you're going to have to store more and so this is pretty heavy memory wise. 100 megs per image, if you have on five gigs of total memory, then you're only going to be able to store about 50 of these. And so also the total number of parameters here we have is 138 million parameters in this network, and this compares with 60 million for AlexNet. Question? [student speaks off mic] - So the question is what do we mean by deeper, is it the number of filters, number of layers? So deeper in this case is always referring to layers. So there are two usages of the word depth which is confusing one is the depth rate per channel, width by height by depth, you can use the word depth here, but in general we talk about the depth of a network, this is going to be the total number of layers in the network, and usually in particular we're counting the total number of weight layers. So the total number of layers with trainable weight, so convolutional layers and fully connected layers. [student mumbles off mic] - Okay, so the question is, within each layer what do different filters need? And so we talked about this back in the comNet lecture, so you can also go back and refer to that, but each filter is a set of let's say three by three convs, so each filter is looking at a, is a set of weight looking at a three by three value input input depth, and this produces one feature map, one activation map of all the responses of the different spatial locations. And then we have we can have as many filters as we want right so for example 96 and each of these is going to produce a feature map. And so it's just like each filter corresponds to a different pattern that we're looking for in the input that we convolve around and we see the responses everywhere in the input, we create a map of these and then another filter will we convolve over the image and create another map. Question. [student speaks off mic] - So question is, is there intuition behind, as you go deeper into the network we have more channel depth so more number of filters right and so you can have any design that you want so you don't have to do this. In practice you will see this happen a lot of the times and one of the reasons is people try and maintain kind of a relatively constant level of compute, so as you go higher up or deeper into your network, you're usually also using basically down sampling and having smaller total spatial area and then so then they also increase now you increase by depth a little bit, it's not as expensive now to increase by depth because it's spatially smaller and so, yeah that's just a reason. Question. [student speaks off mic] - So performance-wise is there any reason to use SBN [mumbles] instead of SouthMax [mumbles], so no, for a classifier you can use either one, and you did that earlier in the class as well, but in general SouthMax losses, have generally worked well and been standard use for classification here. Okay yeah one more question. [student mumbles off mic] - Yes, so the question is, we don't have to store all of the memory like we can throw away the parts that we don't need and so on? And yes this is true. Some of this you don't need to keep, but you're also going to be doing a backwards pass through ware for the most part, when you were doing the chain rule and so on you needed a lot of these activations as part of it and so in large part a lot of this does need to be kept. So if we look at the distribution of where memory is used and where parameters are, you can see that a lot of memories in these early layers right where you still have spatial dimensions you're going to have more memory usage and then a lot of the parameters are actually in the last layers, the fully connected layers have a huge number of parameters right, because we have all of these dense connections. And so that's something just to know and then keep in mind so later on we'll see some networks actually get rid of these fully connected layers and be able to save a lot on the number of parameters. And then just one last thing to point out, you'll also see different ways of calling all of these layers right. So here I've written out exactly what the layers are. conv3-64 means three by three convs with 64 total filters. But for VGGNet on this diagram on the right here there's also common ways that people will look at each group of filters, so each orange block here, as in conv1 part one, so conv1-1, conv1-2, and so on. So just something to keep in mind. So VGGNet ended up getting second place in the ImageNet 2014 classification challenge, first in localization. They followed a very similar training procedure as Alex Krizhevsky for the AlexNet. They didn't use local response normalization, so as I mentioned earlier, they found out this didn't really help them, and so they took it out. You'll see VGG 16 and VGG 19 are common variants of the cycle here, and this is just the number of layers, 19 is slightly deeper than 16. In practice VGG 19 works very little bit better, and there's a little bit more memory usage, so you can use either but 16 is very commonly used. For best results, like AlexNet, they did ensembling in order to average several models, and you get better results. And they also showed in their work that the FC7 features of the last fully connected layer before going to the 1000 ImageNet classes. The 4096 size layer just before that, is a good feature representation, that can even just be used as is, to extract these features from other data, and generalized these other tasks as well. And so FC7 is a good feature representation. Yeah question. [student speaks off mic] - Sorry what was the question? Okay, so the question is what is localization here? And so this is a task, and we'll talk about it a little bit more in a later lecture on detection and localization so I don't want to go into detail here but it's basically an image, not just classifying What's the class of the image, but also drawing a bounding box around where that object is in the image. And the difference with detection, which is a very related task is that detection there can be multiple instances of this object in the image localization we're assuming there's just one, this classification but we just how this additional bounding box. So we looked at VGG which was one of the deep networks from 2014 and then now we'll talk about GoogleNet which was the other one that won the classification challenge. So GoogleNet again was a much deeper network with 22 layers but one of the main insights and special things about GoogleNet is that it really looked at this problem of computational efficiency and it tried to design a network architecture that was very efficient in the amount of compute. And so they did this using this inception module which we'll go into more detail and basically stacking a lot of these inception modules on top of each other. There's also no fully connected layers in this network, so they got rid of that were able to save a lot of parameters and so in total there's only five million parameters which is twelve times less than AlexNet, which had 60 million even though it's much deeper now. It got 6.7% top five error. So what's the inception module? So the idea behind the inception module is that they wanted to design a good local network typology and it has this idea of this local topology that's you know you can think of it as a network within a network and then stack a lot of these local typologies one on top of each other. And so in this local network that they're calling an inception module what they're doing is they're basically applying several different kinds of filter operations in parallel on top of the same input coming into this same layer. So we have our input coming in from the previous layer and then we're going to do different kinds of convolutions. So a one by one conv, right a three by three conv, five by five conv, and then they also have a pooling operation in this case three by three pooling, and so you get all of these different outputs from these different layers, and then what they do is they concatenate all these filter outputs together depth wise, and so then this creates one tenser output at the end that is going tom pass on to the next layer. So if we look at just a naive way of doing this we just do exactly that we have all of these different operations we get the outputs we concatenate them together. So what's the problem with this? And it turns out that computational complexity is going to be a problem here. So if we look more carefully at an example, so here just for as an example I've put one by one conv, 128 filter so three by three conv 192 filters, five by five convs and 96 filters. Assume everything has basically the stride that's going to maintain the spatial dimensions, and that we have this input coming in. So what is the output size of the one by one filter with 128 , one by one conv with 128 filters? Who has a guess? OK so I heard 28 by 28, by 128 which is correct. So right by one by one conv we're going to maintain spatial dimensions and then on top of that, each conv filter is going to look through the entire 256 depth of the input, but then the output is going to be, we have a 28 by 28 feature map for each of the 128 filters that we have in this conv layer. So we get 28 by 28 by 128. OK and then now if we do the same thing and we look at the filter sizes of the output sizes sorry of all of the different filters here, after the three by three conv we're going to have this volume of 28 by 28 by 192 right after five by five conv we have 96 filters here. So 28 by 28 by 96, and then out pooling layer is just going to keep the same spatial dimension here, so pooling layer will preserve it in depth, and here because of our stride, we're also going to preserve our spatial dimensions. And so now if we look at the output size after filter concatenation what we're going to get is 28 by 28, these are all 28 by 28, and we concatenating depth wise. So we get 28 by 28 times all of these added together, and the total output size is going to be 28 by 28 by 672. So the input to our inception module was 28 by 28 by 256, then the output from this module is 28 by 28 by 672. So we kept the same spatial dimensions, and we blew up the depth. Question. [student speaks off mic] OK So in this case, yeah, the question is, how are we getting 28 by 28 for everything? So here we're doing all the zero padding in order to maintain the spatial dimensions, and that way we can do this filter concatenation depth-wise. Question in the back. [student speaks off mic] - OK The question is what's the 256 deep at the input, and so this is not the input to the network, this is the input just to this local module that I'm looking at. So in this case 256 is the depth of the previous inception module that came just before this. And so now coming out we have 28 by 28 by 672, and that's going to be the input to the next inception module. Question. [student speaks off mic] - Okay the question is, how did we get 28 by 28 by 128 for the first one, the first conv, and this is basically it's a one by one convolution right, so we're going to take this one by one convolution slide it across our 28 by 28 by 256 input spatially where it's at each location, it's going to multiply, it's going to do a [mumbles] through the entire 256 depth, and so we do this one by one conv slide it over spatially and we get a feature map out that's 28 by 28 by one. There's one number at each spatial location coming out, and each filter produces one of these 28 by 28 by one maps, and we have here a total 128 filters, and that's going to produce 28 by 28, by 128. OK so if you look at the number of operations that are happening in the convolutional layer, let's look at the first one for example this one by one conv as I was just saying at each each location we're doing a one by one by 256 dot product. So there's 256 multiply operations happening here and then for each filter map we have 28 by 28 spatial locations, so that's the first 28 times 28 first two numbers that are multiplied here. These are the spatial locations for each filter map, and so we have to do this to 25 60 multiplication each one of these then we have 128 total filters at this layer, or we're producing 128 total feature maps. And so the total number of these operations here is going to be 28 times 28 times 128 times 256. And so this is going to be the same for, you can think about this for the three by three conv, and the five by five conv, that's exactly the same principle. And in total we're going to get 854 million operations that are happening here. - [Student] And the 128, 192, and 96 are just values [mumbles] - Question the 128, 192 and 256 are values that I picked. Yes, these are not values that I just came up with. They are similar to the ones that you will see in like a particular layer of inception net, so in GoogleNet basically, each module has a different set of these kinds of parameters, and I picked one that was similar to one of these. And so this is very expensive computationally right, these these operations. And then the other thing that I also want to note is that the pooling layer also adds to this problem because it preserves the whole feature depth. So at every layer your total depth can only grow right, you're going to take the full featured depth from your pooling layer, as well as all the additional feature maps from the conv layers and add these up together. So here our input was 256 depth and our output is 672 depth and you're just going to keep increasing this as you go up. So how do we deal with this and how do we keep this more manageable? And so one of the key insights that GoogleNet used was that well we can we can address this by using bottleneck layers and try and project these feature maps to lower dimension before our our convolutional operations, so before our expensive layers. And so what exactly does that mean? So reminder one by one convolution, I guess we were just going through this but it's taking your input volume, it's performing a dot product at each spatial location and what it does is it preserves spatial dimension but it reduces the depth and it reduces that by projecting your input depth to a lower dimension. It just takes it's basically like a linear combination of your input feature maps. And so this main idea is that it's projecting your depth down and so the inception module takes these one by one convs and adds these at a bunch of places in these modules where there's going to be, in order to alleviate this expensive compute. So before the three by three and five by five conv layers, it puts in one of these one by one convolutions. And then after the pooling layer it also puts an additional one by one convolution. Right so these are the one by one bottleneck layers that are added in. And so how does this change the math that we were looking at earlier? So now basically what's happening is that we still have the same input here 28 by 28 by 256, but these one by one convs are going to reduce the depth dimension and so you can see before the three by three convs, if I put a one by one conv with 64 filters, my output from that is going to be, 28 by 28 by 64. So instead of now going into the three by three convs afterwards instead of 28 by 28 by 256 coming in, we only have a 28 by 28, by 64 block coming in. And so this is now reducing the smaller input going into these conv layers, the same thing for the five by five conv, and then for the pooling layer, after the pooling comes out, we're going to reduce the depth after this. And so, if you work out the math the same way for all of the convolutional ops here, adding in now all these one by one convs on top of the three by threes and five by fives, the total number of operations is 358 million operations, so it's much less than the 854 million that we had in the naive version, and so you can see how you can use this one by one conv, and the filter size for that to control your computation. Yes, question in the back. [student speaks off mic] - Yes, so the question is, have you looked into what information might be lost by doing this one by one conv at the beginning. And so there might be some information loss, but at the same time if you're doing these projections you're taking a linear combination of these input feature maps which has redundancy in them, you're taking combinations of them, and you're also introducing an additional non-linearity after the one by one conv, so it also actually helps in that way with adding a little bit more depth and so, I don't think there's a rigorous analysis of this, but basically in general this works better and there's reasons why it helps as well. OK so here we have, we're basically using these one by one convs to help manage our computational complexity, and then what GooleNet does is it takes these inception modules and it's going to stack all these together. So this is a full inception architecture. And if we look at this a little bit more detail, so here I've flipped it, because it's so big, it's not going to fit vertically any more on the slide. So what we start with is we first have this stem network, so this is more the kind of vanilla plain conv net that we've seen earlier [mumbles] six sequence of layers. So conv pool a couple of convs in another pool just to get started and then after that we have all of our different our multiple inception modules all stacked on top of each other, and then on top we have our classifier output. And notice here that they've really removed the expensive fully connected layers it turns out that the model works great without them, even and you reduce a lot of parameters. And then what they also have here is, you can see these couple of extra stems coming out and these are auxiliary classification outputs and so these are also you know just a little mini networks with an average pooling, a one by one conv, a couple of fully connected layers here going to the soft Max and also a 1000 way SoftMax with the ImageNet classes. And so you're actually using your ImageNet training classification loss in three separate places here. The standard end of the network, as well as in these two places earlier on in the network, and the reason they do that is just this is a deep network and they found that having these additional auxiliary classification outputs, you get more gradient training injected at the earlier layers, and so more just helpful signal flowing in because these intermediate layers should also be helpful. You should be able to do classification based off some of these as well. And so this is the full architecture, there's 22 total layers with weights and so within each of these modules each of those one by one, three by three, five by five is a weight layer, just including all of these parallel layers, and in general it's a relatively more carefully designed architecture and part of this is based on some of these intuitions that we're talking about and part of them also is just you know Google the authors they had huge clusters and they're cross validating across all kinds of design choices and this is what ended up working well. Question? [student speaks off mic] - Yeah so the question is, are the auxiliary outputs actually useful for the final classification, to use these as well? I think when they're training them they do average all these for the losses coming out. I think they are helpful. I can't remember if in the final architecture, whether they average all of these or just take one, it seems very possible that they would use all of them, but you'll need to check on that. [student speaks off mic] - So the question is for the bottleneck layers, is it possible to use some other types of dimensionality reduction and yes you can use other kinds of dimensionality reduction. The benefits here of this one by one conv is, you're getting this effect, but it's all, you know it's a conv layer just like any other. You have the soul network of these, you just train it this full network back [mumbles] through everything, and it's learning how to combine the previous feature maps. Okay yeah, question in the back. [student speaks off mic] - Yes so, question is are any weights shared or all they all separate and yeah, all of these layers have separate weights. Question. [student speaks off mic] - Yes so the question is why do we have to inject gradients at earlier layers? So our classification output at the very end, where we get a gradient on this, it's passed all the way back through the chain roll but the problem is when you have very deep networks and you're going all the way back through these, some of this gradient signal can become minimized and lost closer to the beginning, and so that's why having these additional ones in earlier parts can help provide some additional signal. [student mumbles off mic] - So the question is are you doing back prop all the times for each output. No it's just one back prop all the way through, and you can think of these three, you can think of there being kind of like an addition at the end of these if you were to draw up your computational graph, and so you get your final signal and you can just take all of these gradients and just back plot them all the way through. So it's as if they were added together at the end in a computational graph. OK so in the interest of time because we still have a lot to get through, can take other questions offline. Okay so GoogleNet basically 22 layers. It has an efficient inception module, there's no fully connected layers. 12 times fewer parameters than AlexNet, and it's the ILSVRC 2014 classification winner. And so now let's look at the 2015 winner, which is the ResNet network and so here this idea is really, this revolution of depth net right. We were starting to increase depth in 2014, and here we've just had this hugely deeper model at 152 layers was the ResNet architecture. And so now let's look at that in a little bit more detail. So the ResNet architecture, is getting extremely deep networks, much deeper than any other networks before and it's doing this using this idea of residual connections which we'll talk about. And so, they had 152 layer model for ImageNet. They were able to get 3.5 of 7% top 5 error with this and the really special thing is that they swept all classification and detection contests in the ImageNet mart benchmark and this other benchmark called COCO. It just basically won everything. So it was just clearly better than everything else. And so now let's go into a little bit of the motivation behind ResNet and residual connections that we'll talk about. And the question that they started off by trying to answer is what happens when we try and stack deeper and deeper layers on a plain convolutional neural network? So if we take something like VGG or some normal network that's just stacks of conv and pool layers on top of each other can we just continuously extend these, get deeper layers and just do better? And and the answer is no. So if you so if you look at what happens when you get deeper, so here I'm comparing a 20 layer network and a 56 layer network and so this is just a plain kind of network you'll see that in the test error here on the right the 56 layer network is doing worse than the 28 layer network. So the deeper network was not able to do better. But then the really weird thing is now if you look at the training error right we here have again the 20 layer network and a 56 layer network. The 56 layer network, one of the obvious problems you think, I have a really deep network, I have tons of parameters maybe it's probably starting to over fit at some point. But what actually happens is that when you're over fitting you would expect to have very good, very low training error rate, and just bad test error, but what's happening here is that in the training error the 56 layer network is also doing worse than the 20 layer network. And so even though the deeper model performs worse, this is not caused by over-fitting. And so the hypothesis of the ResNet creators is that the problem is actually an optimization problem. Deeper models are just harder to optimize, than more shallow networks. And the reasoning was that well, a deeper model should be able to perform at least as well as a shallower model. You can have actually a solution by construction where you just take the learned layers from your shallower model, you just copy these over and then for the remaining additional deeper layers you just add identity mappings. So by construction this should be working just as well as the shallower layer. And your model that weren't able to learn properly, it should be able to learn at least this. And so motivated by this their solution was well how can we make it easier for our architecture, our model to learn these kinds of solutions, or at least something like this? And so their idea is well instead of just stacking all these layers on top of each other and having every layer try and learn some underlying mapping of a desired function, lets instead have these blocks, where we try and fit a residual mapping, instead of a direct mapping. And so what this looks like is here on this right where the input to these block is just the input coming in and here we are going to use our, here on the side, we're going to use our layers to try and fit some residual of our desire to H of X, minus X instead of the desired function H of X directly. And so basically at the end of this block we take the step connection on this right here, this loop, where we just take our input, we just use pass it through as an identity, and so if we had no weight layers in between it was just going to be the identity it would be the same thing as the output, but now we use our additional weight layers to learn some delta, for some residual from our X. And so now the output of this is going to be just our original R X plus some residual that we're going to call it. It's basically a delta and so the idea is that now the output it should be easy for example, in the case where identity is ideal, to just squash all of these weights of F of X from our weight layers just set it to all zero for example, then we're just going to get identity as the output, and we can get something, for example, close to this solution by construction that we had earlier. Right, so this is just a network architecture that says okay, let's try and fit this, learn how our weight layers residual, and be something close, that way it'll more likely be something close to X, it's just modifying X, than to learn exactly this full mapping of what it should be. Okay, any questions about this? [student speaks off mic] - Question is is there the same dimension? So yes these two paths are the same dimension. In general either it's the same dimension, or what they actually do is they have these projections and shortcuts and they have different ways of padding to make things work out to be the same dimension. Depth wise. Yes - [Student] When you use the word residual you were talking about [mumbles off mic] - So the question is what exactly do we mean by residual this output of this transformation is a residual? So we can think of our output here right as this F of X plus X, where F of X is the output of our transformation and then X is our input, just passed through by the identity. So we'd like to using a plain layer, what we're trying to do is learn something like H of X, but what we saw earlier is that it's hard to learn H of X. It's a good H of X as we get very deep networks. And so here the idea is let's try and break it down instead of as H of X is equal to F of X plus, and let's just try and learn F of X. And so instead of learning directly this H of X we just want to learn what is it that we need to add or subtract to our input as we move on to the next layer. So you can think of it as kind of modifying this input, in place in a sense. We have-- [interrupted by student mumbling off mic] - The question is, when we're saying the word residual are we talking about F of X? Yeah. So F of X is what we're calling the residual. And it just has that meaning. Yes another question. [student mumbles off mic] - So the question is in practice do we just sum F of X and X together, or do we learn some weighted combination and you just do a direct sum. Because when you do a direct sum, this is the idea of let me just learn what is it I have to add or subtract onto X. Is this clear to everybody, the main intuition? Question. [student speaks off mic] - Yeah, so the question is not clear why is it that learning the residual should be easier than learning the direct mapping? And so this is just their hypotheses, and a hypotheses is that if we're learning the residual you just have to learn what's the delta to X right? And if our hypotheses is that generally even something like our solution by construction, where we had some number of these shallow layers that were learned and we had all these identity mappings at the top this was a solution that should have been good, and so that implies that maybe a lot of these layers, actually something just close to identity, would be a good layer And so because of that, now we formulate this as being able to learn the identity plus just a little delta. And if really the identity is best we just make F of X squashes transformation to just be zero, which is something that's relatively, might seem easier to learn, also we're able to get things that are close to identity mappings. And so again this is not something that's necessarily proven or anything it's just the intuition and hypothesis, and then we'll also see later some works where people are actually trying to challenge this and say oh maybe it's not actually the residuals that are so necessary, but at least this is the hypothesis for this paper, and in practice using this model, it was able to do very well. Question. [student speaks off mic] - Yes so the question is have people tried other ways of combining the inputs from previous layers and yes so this is basically a very active area of research on and how we formulate all these connections, and what's connected to what in all of these structures. So we'll see a few more examples of different network architectures briefly later but this is an active area of research. OK so we basically have all of these residual blocks that are stacked on top of each other. We can see the full resident architecture. Each of these residual blocks has two three by three conv layers as part of this block and there's also been work just saying that this happens to be a good configuration that works well. We stack all these blocks together very deeply. Another thing like with this very deep architecture it's basically also enabling up to 150 layers deep of this, and then what we do is we stack all these and periodically we also double the number of filters and down sample spatially using stride two when we do that. And then we have this additional [mumbles] at the very beginning of our network and at the end we also hear, don't have any fully connected layers and we just have a global average pooling layer that's going to average over everything spatially, and then be input into the last 1000 way classification. So this is the full ResNet architecture and it's very simple and elegant just stacking up all of these ResNet blocks on top of each other, and they have total depths of up to 34, 50, 100, and they tried up to 152 for ImageNet. OK so one additional thing just to know is that for a very deep network, so the ones that are more than 50 layers deep, they also use bottleneck layers similar to what GoogleNet did in order to improve efficiency and so within each block now you're going to, what they did is, have this one by one conv filter, that first projects it down to a smaller depth. So again if we are looking at let's say 28 by 28 by 256 implant, we do this one by one conv, it's taking it's projecting the depth down. We get 28 by 28 by 64. Now your convolution your three by three conv, in here they only have one, is operating over this reduced step so it's going to be less expensive, and then afterwards they have another one by one conv that projects the depth back up to 256, and so, this is the actual block that you'll see in deeper networks. So in practice the ResNet also uses batch normalization after every conv layer, they use Xavier initialization with an extra scaling factor that they helped introduce to improve the initialization trained with SGD + momentum. Their learning rate they use a similar learning rate type of schedule where you decay your learning rate when your validation error plateaus. Mini batch size 256, a little bit of weight decay and no drop out. And so experimentally they were able to show that they were able to train these very deep networks, without degrading. They were able to have basically good gradient flow coming all the way back down through the network. They tried up to 152 layers on ImageNet, 1200 on Cifar, which is a, you have played with it, but a smaller data set and they also saw that now you're deeper networks are able to achieve lower training errors as expected. So you don't have the same strange plots that we saw earlier where the behavior was in the wrong direction. And so from here they were able to sweep first place at all of the ILSVRC competitions, and all of the COCO competitions in 2015 by a significant margins. Their total top five error was 3.6 % for a classification and this is actually better than human performance in the ImageNet paper. There was also a human metric that came from actually [mumbles] our lab Andre Kapathy spent like a week training himself and then basically did all of, did this task himself and was I think somewhere around 5-ish %, and so I was basically able to do better than the then that human at least. Okay, so these are kind of the main networks that have been used recently. We had AlexNet starting off with first, VGG and GoogleNet are still very popular, but ResNet is the most recent best performing model that if you're looking for something training a new network ResNet is available, you should try working with it. So just quickly looking at some of this getting a better sense of the complexity involved. So here we have some plots that are sorted by performance so this is top one accuracy here, and higher is better. And so you'll see a lot of these models that we talked about, as well as some different versions of them so, this GoogleNet inception thing, I think there's like V2, V3 and the best one here is V4, which is actually a ResNet plus inception combination, so these are just kind of more incremental, smaller changes that they've built on top of them, and so that's the best performing model here. And if we look on the right, these plots of their computational complexity here it's sorted. The Y axis is your top one accuracy so higher is better. The X axis is your operations and so the more to the right, the more ops you're doing, the more computationally expensive and then the bigger the circle, your circle is your memory usage, so the gray circles are referenced here, but the bigger the circle the more memory usage and so here we can see that VGG these green ones are kind of the least efficient. They have the biggest memory, the most operations, but they they do pretty well. GoogleNet is the most efficient here. It's way down on the operation side, as well as a small little circle for memory usage. AlexNet, our earlier model, has lowest accuracy. It's relatively smaller compute, because it's a smaller network, but it's also not particularly memory efficient. And then ResNet here, we have moderate efficiency. It's kind of in the middle, both in terms of memory and operations, and it has the highest accuracy. And so here also are some additional plots. You can look at these more on your own time, but this plot on the left is showing the forward pass time and so this is in milliseconds and you can up at the top VGG forward passes about 200 milliseconds you can get about five frames per second with this, and this is sorted in order. There's also this plot on the right looking at power consumption and if you look more at this paper here, there's further analysis of these kinds of computational comparisons. So these were the main architectures that you should really know in-depth and be familiar with, and be thinking about actively using. But now I'm going just to go briefly through some other architectures that are just good to know either historical inspirations or more recent areas of research. So the first one Network in Network, this is from 2014, and the idea behind this is that we have these vanilla convolutional layers but we also have these, this introduces the idea of MLP conv layers they call it, which are micro networks or basically network within networth, the name of the paper. Where within each conv layer trying to stack an MLP with a couple of fully connected layers on top of just the standard conv and be able to compute more abstract features for these local patches right. So instead of sliding just a conv filter around, it's sliding a slightly more complex hierarchical set of filters around and using that to get the activation maps. And so, it uses these fully connected, or basically one by one conv kind of layers. It's going to stack them all up like the bottom diagram here where we just have these networks within networks stacked in each of the layers. And the main reason to know this is just it was kind of a precursor to GoogleNet and ResNet in 2014 with this idea of bottleneck layers that you saw used very heavily in there. And it also had a little bit of philosophical inspiration for GoogleNet for this idea of a local network typology network in network that they also used, with a different kind of structure. Now I'm going to talk about a series of works, on, or works since ResNet that are mostly geared towards improving resNet and so this is more recent research has been done since then. I'm going to go over these pretty fast, and so just at a very high level. If you're interested in any of these you should look at the papers, to have more details. So the authors of ResNet a little bit later on in 2016 also had this paper where they improved the ResNet block design. And so they basically adjusted what were the layers that were in the ResNet block path, and showed this new structure was able to have a more direct path in order for propagating information throughout the network, and you want to have a good path to propagate information all the way up, and then back up all the way down again. And so they showed that this new block was better for that and was able to give better performance. There's also a Wide Residual networks which this paper argued that while ResNets made networks much deeper as well as added these residual connections and their argument was that residuals are really the important factor. Having this residual construction, and not necessarily having extremely deep networks. And so what they did was they used wider residual blocks, and so what this means is just more filters in every conv layer. So before we might have F filters per layer and they use these factors of K and said well, every layer it's going to be F times K filters instead. And so, using these wider layers they showed that their 50 layer wide ResNet was able to out-perform the 152 layer original ResNet, and it also had the additional advantages of increasing with this, even with the same amount of parameters, tit's more computationally efficient because you can parallelize these with operations more easily. Right just convolutions with more neurons just spread across more kernels as opposed to depth that's more sequential, so it's more computationally efficient to increase your width. So here you can see this work is starting to trying to understand the contributions of width and depth and residual connections, and making some arguments for one way versus the other. And this other paper around the same time, I think maybe a little bit later, is ResNeXt, and so this is again, the creators of ResNet continuing to work on pushing the architecture. And here they also had this idea of okay, let's indeed tackle this width thing more but instead of just increasing the width of this residual block through more filters they have structure. And so within each residual block, multiple parallel pathways and they're going to call the total number of these pathways the cardinality. And so it's basically taking the one ResNet block with the bottlenecks and having it be relatively thinner, but having multiple of these done in parallel. And so here you can also see that this both have some relation to this idea of wide networks, as well as to has some connection to the inception module as well right where we have these parallel, these layers operating in parallel. And so now this ResNeXt has some flavor of that as well. So another approach towards improving ResNets was this idea called Stochastic Depth and in this work the motivation is well let's look more at this depth problem. Once you get deeper and deeper the typical problems that you're going to have vanishing gradients right. You're not able to, your gradients will get smaller and eventually vanish as you're trying to back propagate them over very long layers, or a large number of layers. And so what their motivation is well let's try to have short networks during training and they use this idea of dropping out a subset of the layers during training. And so for a subset of the layers they just drop out the weights and they just set it to identity connection, and now what you get is you have these shorter networks during training, you can pass back your gradients better. It's also a little more efficient, and then it's kind of like the drop out right. It has this sort of flavor that you've seen before. And then at test time you want to use the full deep network that you've trained. So these are some of the works that looking at the resident architecture, trying to understand different aspects of it and trying to improve ResNet training. And so there's also some works now that are going beyond ResNet that are saying well what are some non ResNet architectures that maybe can also work better, or comparable or better to ResNets. And so one idea is FractalNet, which came out pretty recently, and the argument in FractalNet is that while residual representations maybe are not actually necessary, so this goes back to what we were talking about earlier. What's the motivation of residual networks and it seems to make sense and there's, you know, good reasons for why this should help but in this paper they're saying that well here is a different architecture that we're introducing, there's no residual representations. We think that the key is more about transitioning effectively from shallow to deep networks, and so they have this fractal architecture which has if you look on the right here, these layers where they compose it in this fractal fashion. And so there's both shallow and deep pathways to your output. And so they have these different length pathways, they train them with dropping out sub paths, and so again it has this dropout kind of flavor, and then at test time they'll use the entire fractal network and they show that this was able to get very good performance. There's another idea called Densely Connected convolutional Networks, DenseNet, and this idea is now we have these blocks that are called dense blocks. And within each block each layer is going to be connected to every other layer after it, in this feed forward fashion. So within this block, your input to the block is also the input to every other conv layer, and as you compute each conv output, those outputs are now connected to every layer after and then, these are all concatenated as input to the conv layer, and they do some they have some other processes for reducing the dimensions and keeping efficient. And so their main takeaway from this, is that they argue that this is alleviating a vanishing gradient problem because you have all of these very dense connections. It strengthens feature propagation and then also encourages future use right because there are so many of these connections each feature map that you're learning is input in multiple later layers and being used multiple times. So these are just a couple of ideas that are you know alternatives or what can we do that's not ResNets and yet is still performing either comparably or better to ResNets and so this is another very active area of current research. You can see that a lot of this is looking at the way how different layers are connected to each other and how depth is managed in these networks. And so one last thing that I wanted to mention quickly, is just efficient networks. So this idea of efficiency and you saw that GoogleNet was a work that was looking into this direction of how can we have efficient networks which are important for you know a lot of practical usage both training as well as especially deployment and so this is another recent network that's called SqueezeNet which is looking at very efficient networks. They have these things called fire modules, which consists of a squeeze layer with a lot of one by one filters and then this feeds then into an expand layer with one by one and three by three filters, and they're showing that with this kind of architecture they're able to get AlexNet level accuracy on ImageNet, but with 50 times fewer parameters, and then you can further do network compression on this to get up to 500 times smaller than AlexNet and just have the whole network just be 0.5 megs. And so this is a direction of how do we have efficient networks model compression that we'll cover more in a lecture later, but just giving you a hint of that. OK so today in summary we've talked about different kinds of CNN Architectures. We looked in-depth at four of the main architectures that you'll see in wide usage. AlexNet, one of the early, very popular networks. VGG and GoogleNet which are still widely used. But ResNet is kind of taking over as the thing that you should be looking most when you can. We also looked at these other networks in a little bit more depth at a brief level overview. And so the takeaway that these models that are available they're in a lot of [mumbles] so you can use them when you need them. There's a trend toward extremely deep networks, but there's also significant research now around the design of how do we connect layers, skip connections, what is connected to what, and also using these to design your architecture to improve gradient flow. There's an even more recent trend towards examining what's the necessity of depth versus width, residual connections. Trade offs, what's actually helping matters, and so there's a lot of these recent works in this direction that you can look into some of the ones I pointed out if you are interested. And next time we'll talk about Recurrent neural networks. Thanks.
Info
Channel: Stanford University School of Engineering
Views: 309,394
Rating: 4.907012 out of 5
Keywords:
Id: DAOcjicFr1Y
Channel Id: undefined
Length: 77min 40sec (4660 seconds)
Published: Fri Aug 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.