CS231n Lecture 7 - Convolutional Neural Networks

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so today we finally get to cover convolutional neural networks so we're super excited but first let me dive into some administrative items as a reminder again assignment 2 is due next Friday how is it Simon 2 going by the way that people finished a fully connected stuff at all some people ok how about the tango I finished batch norm ok ok alright another thing to worry about perhaps for you guys is the project proposal which is due very soon on Saturday it's an ungraded we just want a paragraph we want to make sure that you guys are on the right track you've thought about the project you have a rough proposal for what you want to do you can also send us a few possibilities like and by few I mean - don't send us like hundreds but and we can help you choose a bit but yeah so project proposal due soon ok so we are working with neural networks we're training neural networks in a four-step process last class we talked about parameter updates so we covered a whole bunch of parameter updates we covered atom which is not on that image and we did not cover a de Delta which is on that image but added Delta is kind of analogous to the other parameter updates that I covered last time we also talked about dropout and I introduced briefly convolutional neural networks and I talked about some of the history of the field and how this developed so I talk particularly about the experiments of Google and Wiesel in 1960s with the cat visual cortex and their takeaways from a lot of this research which is that the hierarchical cortex is arranged hierarchically with these simple to complex cells and more and more complex things happening over time and so today we'll get to dive into some of these models in details and we'll talk about convolutional neural networks so first as I did with neural networks I'd like to talk about calm that's without all the brain stuff so no analogies to neurons or anything like that we'll just see what the operations are mathematically and then we'll go into how you can interpret them in terms of neurons being connected in some kind of a simulated brain tissue or something like that so we start off with some image say this is 32 by 32 by 3 C 410 image and as you'll see convolutional neural networks operate over volumes all of these layers in between are going to take volumes of activation and they're going to produce volumes of activations so our intermediates will not just be vectors like they are with neural networks but they'll actually have these spatial dimensions of width height and depth that will maintain throughout the computation now depth here is not to be confused with a depth of a network this is just depth as a third dimension of an in this case there are three channels so I'll say that this volume of activations is three deep but don't get confused with like the depth of a network and a depth of a volume and so the convolutional layer is the core building block of a convolutional net work and the way convolutional layer works is as follows we receive some input volume that goes into the layer and then we'll have all these little filters in a convolutional Network so suppose for now we just have a single filter and these filters will be small spatially so spatially this filter will be only five by five but it spans but it will have three depth and so this three is always goes through the full input volume since because so because the input volume has three channels this filter will be small spatially five by five but it will cover the full depth of the input volume which in this case isn't is free and this filter what we'll do now is we'll take this filter and we'll convolve it over this input volume and what I mean when I say convolve is we're going to slide it spatially through all spatial locations of this input volume and we're going to be computing dot product along the way so this filter here we're eventually going to learn this this is now our W and we're going to learn these filters but just think about this filter as your w's and you're sliding them through spatially this input volume and along the way as we're sliding this filter through we're computing a W transpose times X plus B and the X here is a small piece of your input volume so it's a small piece of five by five by three okay so really what's happening here is at every single position say in this particular position in the input volume we're really computing five times five times three which is seventy five dimensional dot product right and then plus one for bias so seventy five dimensional dot product as we're sliding the filter through now when you slide this filter through this volume spatially you'll end up carving out an entire what we call activation map of activations of the responses of that filter at every single spatial so sliding this 5x5 filter through the 32 by 32 input volume will give us a 28 by 28 activation map of how much this filter likes any spatial position in this input volume where the 28 comes from by the way is the following you'd have to calculate out and we'll go into much more details of this but the input is 32 by 32 and we're sliding a 5 by 5 filter through and it turns out that basically there are 28 unique positions in both width and height that we can put the filter in so when you slide this filter through one at a time you end up covering 28 distinct locations so that's why we have a 28 by 28 activation map that we're producing and that is the activation of this filter at every spatial location in the input volume so now we're going to do is we won't just have a single filter but will actually have an entire filter Bank so suppose that we have a second filter a green filter in this convolutional layer and this filter will also have the same dimensions 5 by 5 by 3 and we're going to be sliding it through the volume as well giving rise to a second activation map and so this activation map is computed independently from the first these are all independent filters so what we'll end up with now is we'll have an entire set of filters so suppose for example that this convolutional layer will have 6 filters and that's just a hyper parameter so suppose we had 6 of them then what we'll do is we'll slide every one of them independently through the input volume computing that product along the way that's actually called the convolutional convolution operation and that gives us the this entire 28 by 28 by 6 set of activation maps that are stacked together along the depth dimension and so what this convolutional layer has done is it has looked at this image and it has rear nted this image the 32 by 32 by 3 in terms of the activations on this image of those filters so we end up with a reason Tatian of the image of size 28 by 28 by 6 and this will be the next input volume that's going to feed in late into later processing so just to give you a preview of how this will work in the convolutional Network and there's a lot of details to be worked out but just to give you a hint of what's coming we're going to have these convolutional layers they're going to have some number of filters and these filters will have some spatial extent like say 5 by 5 and so this conveyor are short for convolution will be sliding it through we get say a 28 by 28 by 6 volume instead of the original volume and then this will feed into the next convolutional layer and of course in the middle of always we're going to be applying activation functions as we did before with neural networks so we perform convolutions which are these linear operations then we threshold all the activations at 0 and then we proceed again with another set of convolutional another convolutional layer with its own filters may be of different size and so in this case for example we also have 5x5 filters and we're going to slide them through this volume now giving us a rise of 24 by 24 by 10 because now suppose we have 10 filters in this second layer one thing to notice here is that in this second layer the green layer the outputs produce in in green here these filters must have been 5 by 5 by 6 now because 6 is the IMP the depth of the input volume okay so those filters are always small spatially like we'll see 2 or 3 they'll be usually like a 3 by 3 or 5 by 5 or something like that but during but they're always match the input depth of the input volume so 6 in the second case and 3 in the first case ok and so where this is going is that these all these filters are initialized randomly of course and but those won't become our parameters in the convolutional Network and so what's going to happen intuitively as we'll end up building an entire feature hierarchy so suppose we have a convolutional Network here this is taken from slides from Young Lagoon so we'll have all these convolutional layers along the way and what you'll find is that when you visualize that this is already a trained convolutional Network when you look at the filters on the very first layer so say these are a little 5 by 5 by 3 filters you'll find that they will through back propagation tune themselves to become all these little blobs of edge pieces and little color pieces and blobs and so these are basically the filters in the first convolutional layer when you visualize them so all of these filters will be looking for these things in the original image when we convolve through and as you go into deeper and deeper and deeper convolutional layers and we're performing this successive operations of calm and calm on top of each other you'll end up eventually with the second convolutional layer for example it's going to be dot prod doing dot products over the outputs of the first comm layer and so it's going to be basically putting together all of these little pieces and making larger and larger kind of pieces out of it that these filters here are going to get excited about so the second convolutional layer maybe you'll get a neuron that gets excited about or a filter that gets excited about like little circles in the image and so on because they'll be putting together all of these guys through these dot products and so on until we end up with all of these and we'll end up basically building up all of these templates for all these different object pieces like little honeycomb some kinds of things or circles and so on extended parts of objects and so all of this will is initially all random and this will all just come about one point to make here by the way is that in this very first convolutional layer I'm visualizing the raw weights the 5x5 by three array but these are not the raw weights right this is just a visualization of what these layers are responding to in the original image right because these filters here are still 5x5 spatially not this looks larger but they are operating over the outputs of these filters right so that's just a subtle point I wanted to bring up and so you end up with an image that's very similar to maybe what Google and Wiesel may have imagined where you have these simple cells looking for say for example a bar of a specific orientation somewhere specifically in the image okay and then we're building up the hierarchy of these features and composing them together spatially to get more and more complex responses of different kinds of objects and so on and so to put this in another way let's for example consider this example where we have this input image here which is a small piece of a car I believe and so in this example we have on the first convolutional layer trained all bunch of filters there are 32 filters of 5 by 5 spatially and these are example activation maps that what these look like in practice so say taking this filter here in convolving it spatially over this image gives us an activation map and in these activation maps white corresponds to hi-hi activations and black responds to low activations low numbers and so for example here we have a filter that has a bit of orange stuff in it and so when you slide this filter through this image you'll see a lot of activations in this area where there's the headlamp or something like that because there's orange stuff there so this filter gets happy about that part and so these are the activation maps produced by these filters well I'm up along the depth and then we'll feed that into the next convolutional layer which will be putting together combinations of these guys over and over again and so a convolutional network will basically have a layout like this and we'll see how we will be arranging all of this soon but there will be basically three core building blocks a convolutional layer rectifier layer which is like a non-linearity just Thresh holding there will be pooling operations which we'll go into in a bit and there is a fully connected layer at the very end which will also go into in a bit but basically the image feeds into in here and what I'm visualizing here is that every column here is then is a volume of representation along the way through the convolutional Network and every row here is a activation map so there are ten filters I believe in the first column player and these are their activation maps we then have thresholded them then we did another convolution and then we threshold it that will do a pooling operation as well see and piece-by-piece will basically creating these 3d volumes of higher and higher level kind of abstraction until at the end we end up with some volume and then these class scores for different classes of c510 say example will just be a large full connected layer meaning that it's just like in a neural network where you have these neurons that look at all they have connections to everything in the previous volume so this last volume heal will be stretched out into a giant column and then we'll do a last matrix multiply to get the scores for all the different classes that's what convolutional layers will be doing now to take a closer look at us we're going to take a closer look at the spatial dimension of exactly how this 28 came about and what's happening there but before I dive into some of those details of the spacings here and the sizes of filters and all this mathematics is there any questions about just a basic organizational kind of layout and what convolutional errors are doing go ahead how do we choose number of filters that's a great question for a few slides so we're going to go into that but the basic I'm trying to see if the basic operation makes sense go ahead yeah thank you so you're saying that we're basically computing dot products and we can represent those as a as an fi layer so we could have an offline layer that yes I think we'll go into that in a bit basically think about it it's a local connectivity to the input still still doing dot products but not with the full image and so yes it's mostly a computational kind of control also as we'll see it's a parameter it's an overfitting and control as well and we'll come back to that in a bit so let's look at how the spatial sizes work out now with convolutional layers we saw that we took a 32 by 32 volume and filtered with five by five and got a 28 by 28 activation map so looking at how these spatial dimensions work out more generally we'll try to work that out suppose that our input is seven by seven and assume that we have a three by three filter that we're going to be sliding through this input volume and I'm showing here a top-down view so the depth dimension here is not shown so we have we're only concerned about the spatial dimensions at this point so as I slide a three by three filter through a seven by seven volume I'll do a dot product here then I'll do that product there and I'm moving it one piece by piece until I see that there are five places where I can fit my filter so the output volume from this come later in this case would be a five by five output now I can try to go in a stride of two so with a stride of two I would start here and stride is referring to how much you shift your filter at a time so right now we shifted it one at a time if we go with a stride of two and this will be a hyper parameter for a convolutional layer then we go to two at a time so two and two gives us three unique positions that we can put the filter in so with a stride of two which will be a hyper parameter you would get a three by three output and we can more generally think about what happens with just arbitrary things here so say I wanted to do a stride three what would be the output volume size in that case - yeah so - is a reasonable answer maybe even three is a reasonable answer for the purposes of this class and the notes and everything we'll do in the class I'll just say that it doesn't fit and you can't do it okay so we'll just say that it does not fit equally and the reason is the following basically there's a very simple formula that Dell tells you how many filters you can fit in a given size of n so suppose we have n by n coming in and your filters are of F by F then this formula and with the stride it tells you exactly how many will fit so if the input is seven minus three over one plus one gives us five when the stride is one when this right is 2 we get 3 with the stride is three we get a non integer and that tells you that this does not fit now in practice different implementation might deal with this differently some might throw an exception some might give you a two by two by two output and ignore some parts of the input or something like that so this is just an undefined operation for our purposes so this should fit and this should always divide equally so there's exactly correct number of filters to tried it with okay one other thing I'll mention in terms of spacing out the convolutional layers and we'll see how this is used is that it is very common to pad the inputs sometimes so in this case for example I can apply a padding of a1 so one border of padding of zeros to my input a padding will be again a hyper parameter of a convolutional layer by default it's usually zero but we can choose to pad with zeros and let's see why that might be useful suppose that you have an input of 7 by 7 and you want to take a 3 by 3 filter through it with a stride of 1 then if I pad with 1 and I think about what is the output will be the output in this case 7 good so the thing to notice is that if I pad my border here with 1 then I get the same size output as my input came in and so this is just a very nice desirable process we'll see why this is a very nice property when we actually design architectures you really don't want to think too much about the sizes and so what we'll see is that very often we'll want this property where sizes are preserved spatially and so we'll be using padding quite a bit and in general notice that we had a 3 by 3 filter and we wanted to pad with 1 in order to preserve the size spatially but if your filter had takes different size like in general FIF then you'll always want to zero pad with exactly F minus one over two and so basically if you have a three by three filter you want to zero pad with one if you have five by five filters or a pad with two if you have a seven by seven filter use your pad with three okay so in those cases if you use exactly that zero padding with that filter size and using stride one you'll always achieve the same output volume spatially and we'll see why that is very nice in a bit good yeah okay so we've 0 padded and now basically what that would happen now is that instead of a seven by seven input we really ended up with a nine by nine input yeah like we we add a yeah just we just expand out the input volume with zeros on the orders and so the first filter position I can place it in is right here where it includes some of those zeros and so when you count it up you get seven you know and so one nice property of of preserving the spatial sizes will see is if you don't do it then what's going to happen as we saw in an example so this is an example from a few slides ago if you just take five by five filters and you start convolving on top of 32 by 32 image you'll see that this size just shrinks over time and this is not a nice property to have we went from 32 to 28 to 24 and you end up with a very quick decrease of spatial size and as you think about convolutional networks we want to have tens hundreds of layers so obviously we want we don't want to rapidly decrease the size of our representation very quickly and so we'll basically want to do a lot of computations a lot of convolutional layers without shrinking the size of our representation because there's fewer numbers that are representing the original image and so we kind of want to keep a fixed sized representation for convenience reasons but also for representational reasons and so that's why padding is useful okay so let's see one more example maybe I'll take a questions right after this example so suppose we have an input volume of 32 by 32 by 3 and we want to have a convolutional layer with ten five by five filters applied at Stride one and using a padding of two would be the volume output size from this convolutional layer in this case right so we'll have a width height and depth so the width is 32 height and depth up I see distribution now okay so that we have 10 filters so the depth will be 10 okay so the way you compute this out now is we have two padding so here the this is the formula I've shown you before so we have the input size which is 32 we're using a padding of 2 and so we're adding this padding on both the we're back basically padding the entire image so you end up with twice that and so 32 plus 4 is 36 minus 5 okay so you end up with 32 basically spatially and since we have 10 filters every one of those filters will give us a single activation map and we stack them up and we arrive with the volume of 32 by 32 of 10 of responses okay now think about what is the number of parameters in this layer number of weights 250 how would you arrive at 250 10 times 5 times 5 times 3 so what is that 750 you'll agree yeah okay here so 750 I would add that okay usually we also consider the biases every filter will have its own bias as well so 750 is correct except you're not accounting for the plus 1 for the bias so we'll arrive basically every filter is made up of these 5 times 5 times 3 numbers plus 1 bias 76 parameters per filter and we have 10 of them so we end up with 760 parameters in total in this particular convolutional layer with these high parameters so as a summary of a convolutional layer to just put this with a lot of parameters here you always accept a volume of W 1 H 1 D 1 and a convolutional layer produces a volume of activation W 2 h 2 d 2 and a convolutional layer takes for Hyper parameters k FS NP so the number of filters you want the spatial extent of these filters the stride at which you want to apply them and the amount of zero padding you want to do on the borders and using these per hyper parameters we can uniquely compute the number then the size of the active output using the formula I've given you and we also know that the depth will be the number of filters and the total number of parameters that we're introducing as we saw in a previous example depends on basically the filter sizes times the input depth and then that's number of weights and then times K because that's the how many weights we have and then plus K biases one for felt one for each filter and just computationally in terms of how you arrive at every single slice of your active volume is just by doing this convolution of sliding this filter through and computing that product in terms of common settings of these hyper primers that you'll see in practice and we'll see a lot of examples of case studies by the end of this lecture k usually is chosen as a powers of two for computational reasons because if some libraries when they see powers of two in terms of number of say like your dimensions or number of kernels sometimes they go into a special subroutine that is very very efficient to to perform in a vectorized form and so that's why we use powers of two usually sometimes that can give you actually an improvement improvement in performance and what we'll see is that the filter size is typically Glee's typically say our three by three filters in which case we want to pad with one so that we perverts deserve a spatial size or if they're five by five filters we want to usually pad with two or maybe we want to go at a higher stride we'll see examples of all of applying all of this and in the last example one you might see often also as filter size one by one applied its stride one with the padding of zero and I just like to make the point that this sometimes confuses people to see a one by one filter with convolution that things not make sense I just like to point out that this actually does make sense in convolutional layers and the way to think about it is suppose you're working with this example so 56 by 56 by 64 volume coming in and doing one by one comm with 32 filters would give you the same sized output except for 32 in depth now and each filter is one by one spatially but you have to remember that we're doing these dot products through the full depth of the volume so every filter here would actually be doing a sixty four dimensional dot product because these filters extend through the depth of the input volume okay so normally when people cover convolutions in say mathematics classes and so on you're dealing with two-dimensional signals and so it doesn't make sense to talk about say one-by-one convolutions really but in this case because of the depth we actually end up performing a sixty four dimensional dot product through to the depth as people like to call these sometimes fibers or depth columns through the fiber of the input volume we end up doing a dot product and then later we're going to threshold that and so you're doing some reasonable computation is just you're not merging in any information spatially in that particular case okay so at this point any questions about convolutional layers good thank you I forgot to mention why is the size of F always odd and you almost always see 1 3 5 7 11 sometimes and that's mostly what you'll see you won't see even numbers for the sizes of filters just because just because about filters kind of have this nice representation of you know 3 is the smallest thing that makes sense in terms of having something on the left and on the right of a filter it's kind of you can do two by two filters I've seen that I've seen some people do it but it's not very common so the most the lowest people usually use is 3 by 3 just for convenience it makes sense it's kind of the lowest filter size it I think makes sense because you have something to the left of you something right of you and you're applying filter in a specific well-defined position not somewhere in a weird part of you're applying the filter around a specific position that exists in Deepavali I'll go ahead so you're asking about what if the depth is very very deep but your spacial dimensions are relatively small that would be perfectly fine so you might have say a six by six by five 12 is something you will see in some of the convolutional networks we'll see so very deep and very small spatially and that's okay no problems there all right I'll go ahead oh I see thank you so you're asking when we're doing padding why not pad with something more reasonable like say we're like there's many other schemes you might come up with with how to actually pad it's just because I don't think there's may may be very good explanation to that it's just when it's zeroes then your filter does not take those things into account it's only taking your actual numbers into account and when it's doing dot products with zeros then those things don't end up contributing to the to the output of that filter and so that's probably why people choose 0 isn't because you want to have this concept of the filter say like ignoring that part of the input something like that if there's something in the picture so when we so when we do this padding and this is our input image then when we do this padding around we do this padding around the full input image so you actually don't know what's outside of that image would be padding with zeros because you don't know actually what's in that measure I say yeah so you're saying that we could maybe fill this with immediate neighbors or something like that in practice that's true I don't think people end up doing that well but the other that's something you might imagine I don't think it's common good thank you so our are we always working with squares the answer is yes so this came up in a very early class where someone asks a for image net we have all these images of different sizes like say rectangles and so on we always resize to just squares just by default and we'll see how we can process non-rectangular images later I think but just for now just assume everything is a square okay good can you use filters of different size in the same layer usually that is not done just as a probably computational convenience that is not common to see thank you yeah so you're asking is it common to not basically do padding and only do convolution inside stuff that you actually have data for so avoiding padding so I think to that point I have to go through a lot of slides here if you were to do that where did I cover this okay I'm completely lost in my slides now okay out here so if you were to not pad ever then your comp layers would always decrease the size spatially and that's not something you might want necessarily because if you're collapsing the size of your representation too quickly then it's just computationally you want to do more computation at the fixed spatial size for awhile and you don't want to necessarily just reduce it very quickly empirically it just doesn't work as well I tried it doesn't work that idea good thank you so for each layer we're doing the convolutions always across on the previous input volume not on the original image so only the first convolutional layer has access to the image the second convolutional layer only has access to outputs of the first convolutional layer and so on all right so just to give you examples of what this looks like by the way in total libraries like say torch we have these poor hyper parameters that I mentioned so if you look for example at the API of spatial convolutional layer and torch you'll see that they require a whole bunch of parameters here so they require for example an input plane an input plane it actually is not one of these hyper Prime that is the depth of the input volume and they need to know that when you construct a layer because they're about to initialize the filters and so these filters when you initialize memory for them they need to know is the input depth because that determines how large they will be so you take the input plane how many how many channels are there in the input volume output plane is here how many filters you have kW and kdh are the kernel width of the convolution people use interchangeably the two words filter and kernel you'll see me also interchangeably use the two terms so the kernel width and height are the F and D wdh here are the step of the convolution by that banging stride and pad here is what padding you want so you see those four hyper parameters and also here you have to pass in the input depth falling as well you'll see the same and say cafe I don't want to go into this too much but you'll see like num output this is number of filters what is the kernel size what is this stride so these kinds of parameters and I was going to go through lasagna but I think an interest of time they all take the same height for Hyper parameters trust me and so that's what defines a convolutional layer okay so I'll go to the brain neuron view of the convolutional layer instead of talking about filters let's try to talk about how neurons are wired in this simulated brain or something like that so basically this filter here as we're sliding through the image at this particular position this filter is computing this dot product and this of course is very analogous to what we've seen before so these neurons the computer W transpose X plus B on their inputs so we can interpret this the output of the filter at this position as just a neuron that is fixed in space over there and it happens to be looking at a small local region in the input image and it's computing W transpose X plus B so it's connections here are in this particular image and it doesn't have connections to the other parts of the image a local connectivity pattern and we would also sometimes say that this neurons receptive field is five by five that's the size of it's the size of the region of the input volume that is looking at so that's just some terminology and also what's interesting is that as we slide the filter through with these weights we use the same weights throughout because it's just one filter while sliding through the column and so you can imagine that for one activation map we think of that as a grid of neurons arranged in a 22 by 28 by 28 grid and these neurons are all looking at their own little five bite size patch in the input volume but all of them share parameters because it's one filter computing all the outputs so all the neurons have the same weight W that they're using but they're all looking at slightly different parts of the image right so they all share weights and they have local connectivity those are two most important parts and okay so that's the that's neurons that share weights in one activation map but of course we have actually several filters so for example we have five different filters so we actually end up having the full view of this of this is that you have a 3d volume of neurons arranged in this 3d spatial layout and all of them are looking at the input volume in a local pattern and sharing parameters across space here but across depth they're all different neurons so these five neurons here are all looking at the same patch of the input volume but they all have different weights but they share those weights especially with their friends in the same depth slice okay so that's the neuron way of looking at what this is doing we have this 3d arrangement of neurons and they have local connectivity and they share patterns in this funny way and the reason they share pet they share parameters is a nice advantage of both the local connectivity and the parameter sharing is this basically controlling the capacity of the model so it makes sense that neurons spatially would want to compute similar things like say they're looking for little edges you might imagine that a vertical edge or looking for a vertical edge in the middle of an image is just as useful as looking for a vertical edge anywhere else spatially and so it makes sense as a way of controlling overfitting to share those parameters spatially so all of these neurons so there will be all these 28 by 28 grid of neurons looking for just say a vertical bar at all the local although spatial positions and they have local connectivity because that's partly inspired by some of the experience and say hoobland Wiesel and so on we don't want full global connectivity because then you have way too many parameters so small filters and then we make them make a large depth of the network okay so right now I've covered the comp layers we know what the real layers are so there's pool and FC to go that I'll just talk about briefly the pooling neurons what they do is as I mentioned the com the conv layers as well see in a lot of the case studies at the end of the class when we'll be doing comp operations we won't be shrinking the volume size spatially so we'll be preserving the spatial size the reducing of the spatial size will in many cases be handled by the pooling layers and intuitively what pulling layers are is they take your input volume and they just squish it spatially by just doing a down sampling operation this down sampling operation happens on every single activation map independently so say you have a - 24 - 24 64 input you'd end up with half of that spatially so 112 112 64 and every one of these depth slices of activation Labs are basically just down sampled down and insert it back and so you're it's just a squish operation of down sampling now mathematically really what we're performing the most common form of actually doing the down sampling that turns out to work best and you can imagine other things as well but the most commonly uses max pooling so say we're doing a max pooling of two by two filled with two by two filters and stride - that would correspond to doing these two by two max pool filters that we stride at - which gives you exactly a reduction of a half in the spatial size of all activation maps so we're doing max pooling meaning taking the max over all this two by two low pieces so we end up with six in this block and so on so that's a very simple max blending operation just shrinks down the size using a max some other things you might see also it's average pooling or instead you use average of these numbers turns out to not work as well as doing max bullying and that's mostly the two common ones you'll see so otherwise the definition of a pooling layer is similar to the comm layer we accept a volume of activations W 1 H 1 D 1 we produce a volume of activation W 2 h 2 d 2 and the only kind of difference is instead of four parameters we're only taking two now we need to know the the filter size of what are the was the size that we're doing the max operation over and we need to know at which stride to to go as well and so common numbers that you'll see in practice not too many possibilities are just two by two max Bowl with strike two so exactly half a reduction sometimes you'll see a three by three filters try to as we'll see a bit in a bit with incumbents and notice that the output depth preserved so d2 is d1 so in this case we're not changing the depth of the volume okay and the fully connected layer at the very end just very briefly what we're doing is we're taking this volume at the very end so for example in this architecture we had I think this input image was a 32 by 32 and we did calm calm okay this is going to come back or no I'm just worried about the camera feed because otherwise I'm not visible if it's it's not turned on thank you so all of these calm relu things they did not change the spatial size and the pooling layers here we're a two by two stride too so the size was decreased by half so we started off with 32 down sampled here by 16 to 8 to 4 and then I had 10 filters here I think so basically we ended up with a it was 32 16 8 4 so at the end we have a 4 by 4 by 10 volume of activations after the last cooling layer and this goes into the last fully connected layer and if little connected layer just has a neuron all these neurons for computing the class scores and they're all fully connected to this output so we have hundred and sixty numbers here we just Thresh them out into a column and do the last matrix multiply to compute the activations as you saw in just vanilla neural networks okay so now I'd like to show you a demo you can tell maybe by now that I just like implementing everything in JavaScript and so I have a convolutional network in JavaScript so this is training a ComNet in in the tab right now and we'll see a loss function here in a bit and you can play a lot you can change the learning rate and so on you can change the network definition because this is all running in the in a browser so right now we have some calm cool comic book on fuel and softmax at the end and what you see here is all the activations in between so right now these are all random filters so these are the filters here they're all random they're being convolve with the input image and you get all these activation maps then you do threshold in with relu then the pooling operation comm operation relu pool comm so on and at the end we're classifying and so we've been training for like roughly 10 seconds in the browser but we're already getting roughly 20 Centon CFR 10 and so you can see the predictions of the calm that are already starting to sometimes be reasonable okay so note that this is a bird at this point so that's good okay we can also load a network that is pre trained so I have a preacher network that I trained overnight in JavaScript here and this turns out to actually get about 80 percent accuracy on CFR 10 in JavaScript so you guys should be beating this when you actually use numpy by the way so this is just a relatively small network getting 80% and you can see that the filters are basically these little we call them the bore like filters these little edges and you can also see I'm visualizing not just the activations of the activation maps but also the gradients and yes so at the end here we're actually like correctly classifying 80% of the CFR 10 data and so you guys can look through this and play with it the project is called common ideas and this is a CFR 10 demo for common ideas okay did I GPU know this is all just JavaScript for loops I think there's like this convolutional layer for example I think it's like a six or seven nested loops or something like that but it turns out that the v8 engine in chrome is actually really good so javascript happens to be actually relatively fast today so this is all possible but WebGL should be implemented soon okay so at this point I'm going to go into lots of case studies for convolutional networks and we'll see well all the winning convolutional net works for all image net competitions and we'll see how people actually wire up these convolutional net works in practice before I dive into that I can take a few more questions regarding combats so go ahead so you're saying instead of stacking filters like one on top of another and terms of comp you'd like to do something else instead I see it's just we like to do dot products because we know how to back drop through them efficiently and it's just a very simple functional form and so if you want to do dot products and you want to have a local receptive field and you want to do shearing so that you're not exploding with number of weights so a lot of these design decisions make sense I don't know what other simple mathematical functions you maybe want to use instead so in practice anything you can backdrop through you could put in a continent or into a neural net in general so we use these functions because they happen to Train efficiently and maybe partly historically but I also have trouble thinking about what what else you'd put in there so calm calm layer is the main kind of workhorse of convolutional networks that's doing all the most important computation good yeah so these layers layers a very big depth like volumes a very big depth so these volumes are all basically three dimensional numpy arrays say or maybe if you're working in a bashed implementation then these are four dimensional arrays and so it's just a all these four dimensional arrays you're operating over in your lair and you're producing four dimensional arrays or three dimensional arrays so the input image is just a 32 by 32 by 3 say array of numbers right and then com player will transform that into a different volume that's not exactly your question but I guess I'm not understanding Oh what am i doing this particular visualization or oh I see okay so on the first layer you can visualize the weights because they tied directly into the image how do we visualize the later comp layers in their filters thank you so know there's a few ways we look into I flash the slide very early on it's like 50 slides that I can't find it do you remember it no okay we'll find it there's some techniques that have been developed over the last two years so these visualizations are visualizing what the neurons are responding to but they're not visualizing what those filters are and there's no good way to visualize that in fact so income that Jas I visualize them just by like still insisting on just like here's the here's the weights but you can't interpret them when you look at them because it doesn't make doesn't make sense because they don't directly connect to the image right okay good I see yeah so so you're saying that some of the filters later down like say they respond to say um little car wheels or something like that when you do the pooling operation effectively what you're doing is you're throwing away a tiny bit of spatial information so so once I perform the pulling operation I know that there was a car wheel somewhere in the 2x2 square or whatever that projects to back in the image but I don't know exactly the position of that car wheel because I thrown away some spatial information because I don't know where in this grid this six came from so spatially we've lost a tiny bit of information but we still know there's a car wheel somewhere there and so we are throwing away a tiny bit bit of information as we do every single polling layer this content is too like abstract the information from high dimensional unlabeled yeah that's right so I mean in some to some extent we're getting at this image we want to throw away some information at some point maybe I mean it's not really clear but you want to eventually just get this course out and so we're just doing all this compute turns out that you can afford some invariances sort of you don't have to know the precise position of some of these things and so the pulling layers throw away some of this information at the same time there are papers will go into this codes for example is spatial information preserved income nets and they study it and in fact it is and so this almost seems like we're throwing away spatial information but companies are still very good at precisely figuring out where things are in the input image and so that seems like a paradox and we'll go into that in a bit well we have too many questions go ahead is there ever depth reduction before the collect for the collected layer there is well go into specific use cases of sizing all of this so maybe anything that has to do with sizing of architectures in practice maybe save that for a few more slides like the values of the convolution products like the center will be higher yeah thank you so you might say that for example these comb filters they're used to seeing stuff and at the edges if you put zeros there it's kind of like some of these filters might compute say the scale is slightly off because of the zeros which normally you'd expect some relu activations or something so it's true that maybe the statistics on the border are slightly different than in the center of the inputs but we basically don't worry about it too much but I do agree with you that it's kind of an issue I haven't seen anyone like properly address it very says finally it's a bit of an issue because the statistics at the border if we pad with zeros might be slightly different than what you'd see in the center go ahead do we initialize filters with some sensible initializations likes a specific edge filters we don't just train it end to end it's almost always these filters that have these edges they pop out very clearly always so it's not really an issue it seems okay let's look at some specific examples because I think it's going to clarify a lot of questions as well so let's look at first Lynnette five from 1980s I'm not going to actually go into a huge amount of detail here you can see this is a figure from paper we receive a 32 by 32 image then they had six kernels six filters that were 5x5 so they use 5x5 throughout this architecture so six 5x5 filters which brought it down to 28 by 28 and then they did subsample in layer s2 which is subsampling or max pooling so they suck sampled it and then they did a 10 convolutional 16 convolutional filters again 5 by 5 apply that's tried 1 and so they got 10 by 10 and then they subsampled that again so you get 5 by 5 reduction by half and then eventually they flip to fully connected layers so the architecture we would say is carpool called Volkov FC or something like that and they use 5x5 filters apply this red one and the pooling layers where to buy to apply let's try to I don't want to dwell on this too much architecture too much because we have a few more interesting architectures as well so we'll go into much more detail in alex net so alex net is the architecture from switch sk at all in 2012 that famously won the internet competition by big margin so its inputs where to 27 by - 27 by 3 images and it had this architecture here this architecture by the way is cropped on the top you can see that that's not my mistake that's actually from the paper it's cropped like that in the paper so what's funny is that when people get presentations on this architecture it's always a cropped figure and people are like why'd you crop it incorrectly and it's that's when the paper another thing to point out is it has this funky architecture where you have two separate streams you'll see that it's kind of hard to see but there are two separate streams that compute similar things that's mostly a historical reason because alex at this time GPUs were not very not a lot of memory on the GPS so he had to split the convolutional network on two separate GPUs so he had two separate streams of all these convolutional layers on two GPUs but we're not going to worry about that so we're going to talk about like a simplified Alec's net suppose we did not have that concern and we just have a single stretch of ComNet okay so let's look at the first convolutional layer which in this architecture had 96 filters 11 by 11 and they were applied that's tried for over this input what would be the output volume size in this case so the width height R give it a hint 55 what is the depth of the output volume after the first call 96 awesome so we have 55 by 55 times 96 because there are 96 kernels what is the total number of parameters in this first convolutional layer maybe you can't compute that I don't know okay that is kind of difficult to compute so I'm just going to give you it I didn't expect you to do this in your head so it's 11 by 11 times 3 because every one of these filters is 11 by 11 by 3 channels there's 96 of them it's roughly 35,000 parameters one thing to note by the way is um one thing oh wait sorry one thing to note by the way is when you look at this figure you'll see that the input image here it says 224 instead of 227 so Alex when he describes this architecture in the paper he insists that the images are of size 224 but if we take 224 images and you take them through 11 by 11 filters I strive for you don't get 55 so it's improperly sized and this is kind of like a mystery that many people are always confused by when they read the paper it does not work the first layer should be to 27 by 227 so we're not sure what Alex did but when you read the paper do not be confused just like everyone else this first layer should be to 27 by 227 otherwise it doesn't work so otherwise you can't get exactly 55 output so I'm not sure exactly how he patted the input or something like that but it's not described in a paper don't be confused by it okay so after that we do pool one and so we take this call Youm and the next layer is a pooling layer it has three filters apply let's try to so what is the output volume size after the pooling layer it's 27 by 27 by 96 okay good good so there's a many other oh shoot I gave away my oh how many parameters are there this is trick question okay okay there are zero all right there are zero parameters in the pooling layer only the Cobblers have parameters so there's many other things in Aleks net I'll actually put down the full architecture so we have calmed pool norm is a special normalization layer that used to be used at the time and around 2012 10 and so on now we don't actually use normalization layers anymore they were doing a particular kind of funny normalization operation that we don't use anymore because doesn't actually give any improvements so I didn't cover it but there was a normalization layer at the time socom pool norm comfortable norm Kham Kham Kham pool and then fully connected 7/8 and so on and so you can see all the different filters we had 11 by 11 in beginning but then flips to 3 by 3 filters throughout and a lot of these columns layers here for example they apply 3 by 3 filters at Stride one with padding of 1 so at this point the image has been reduced to a size of 13 by 13 spatially and then we're just doing all these operations on it without actually resizing the spatial dimensions of this input so we keep it at 13 by 13 for a long time until a max pool then we arrive to a 6 by 6 by 256 volume at the end and then we put a fully connected layer of 4096 neurons at that and then we do one more fully connected layer and last few neurons so in general you'll see as we go through many of these architectures is just the sandwich of Hong Kong pool Kong Kong Kong sometimes pool and then fully connected at the end and then you have your scores and so basically we've transformed the image the original image of 224 by 224 by 3 we see how it was transformed through a series of operations that are all differentiable into our class scores at the end which in this case we have 1,000 neurons at the very end because there are 1000 categories in image net and so those are our class course and we back propagate through all of this ok one note I actually I forgot to mention for the comm blares when you do the backward pass you have to be careful with that right because the parameters are shared across like as you convolve with the filters all these neurons share parameters so you have to be careful that you add up all the gradients across all those filters into a single weight blob sort of now in terms of curve architecture what was interesting about it at the time when it came up is it was the first use of rectified linear units or at least I would say definitely popularized it at least in the computer vision community I'd used normalization layers which are not used anymore too much it used heavy data augmentation we'll go into that quite a bit it really helps what that refers to as you take your original image and instead of just piping through that image we're going to actually be during it spatially we're going to be resizing it a bit changing its colors a tiny bit and so we're hallucinating additional training data it used a drop off look point of point five but it only used drop out in the last few fully connected layers I'd use the batch size of 128 trained with SGD momentum of point nine the learning rate was one in negative two and it was reduced by ten whenever the validation accuracy as you're monitoring and whenever it plateaued you'd reduce by a learning rate by a factor of ten and and basically you'd reduce it once or twice and by then the network is basically converged and trained uses LT weight decay of 5e negative four and so these are like rough numbers I'm showing just so that you have a single kind of examples of what people use in practice in terms of these high parameters and the performance on the image net test test set was at the end for single convolutional net work was eighteen point two percent error for if you form a seven convolutional network ensemble you get fifteen point four percent error so if you remember from last lecture you're supposed to get two percent extra when you do an ensemble in this case we're seeing a bit better than two percent so you know the approximate rule but yes the bestie has the alux net so we're going to see some variations of this now on I guess I should also point out that number of filters you can see that it went from 96 to 256 to 384 384 down to 256 and those are rough numbers you'll see now in 2012 this was the winner of the image net challenge in 2013 the winner was what we call that F net from progress so this is also a paper that you can check out it built on top of Alec's net architecture but they made some modifications based on some experiments and in particular the big changes were for the comp one originally there was an 11 by 11 stride for convolution on the first layer and they found that that was too drastic you're skipping way too much and you had all these funny artifacts in the first layer so the instead recommended that you do 7 by 7 filters off stride 2 instead of this and so you're doing more dense computation on the very original image and for the comp three four five layers con three four five layers here they used larger filters so there's more kernels in there more filters so originally this and they changed it to that and basically based on their experiments they found that this was a lot of the limiting factor here is you want to increase these complex to get better performance another thing I also wanted to mention by the way sometimes you'll see people talk about extracting fc7 features from images that's coming from actually it's from alex net because he named these layers in this particular way and fc7 is this last layer before the classifier so people sometimes talk about extracting fc7 features even if they're not actually fc7 because they're coming from different architectures people still sometimes use fc7 as just the last thing just before the classifier and what I'm writing what I'm also not writing here is that there's of course relu layers after every single comp so there's real layers I'm not writing after every single comp and after every single FC ok so that was that F net I'll talk a bit about the next after 2013 oh and I should mention the top 5 error went from 15.4% here it went down to 14.8% with this architecture and the winner actually so Matthews Euler after completing this paper where he got 14.8% he created a company called clarify he worked on this a bit more and he ended up with 11% so that was the actual submission to 2013 image net challenge has 11 percent error I'm also saying top 5 error because that's how we evaluate image net challenge we're not just looking at accuracy but we actually allow the classifier to have a top 5 choice of what it thinks every image is because there's 1,000 different classes so we give it 5 chances to guess every single image and so the top 5 error is saying what is the fraction of images that you have incorrectly labeled in all of your top five choices so basically in your top five you did not get that image correct so we're roughly not getting correct a 14.8% at this point okay so now we'll talk about VG net from 2014 again to give you an idea of what these look like in VG net paper they had several architectures that they tried what was nice and very interesting about this paper at this time was instead of going kind of crazy with the architectural choices of how many filters and how many sizes and so on there were way too many parameters in how you change the filters what the sizes of the filters were and so in the VG net Karen Simonian and angel system and they just said okay let's throughout the convolutional Network just commit to using three by three cons tried one with pad one and two by two max Bullough strike two that's all that's the only spatial dimensions we're using so throughout the entire ComNet we're just using these guys and now it's just about how many you put in there and so it turned out that a 16 layer model I think this one ended up performing best I'm going to go into some of the details of this architecture I just like to point out that there error went down from 11.2% which was the previous year and it got it down to seven point three percent error at this point I'm going to step through this architecture now and this took me a while to create so I went through and I wrote out all this what is the size of the volume at every single intermediate step through the convolutional layer through this d column here which ended up working best in their paper and you'll see that it's basically a sandwich of we start off with the 224 by 224 image and we do come with a 64 filters comb 60-footers max pool comp cough max pool cough cough cough next pool and so on and so what we're seeing here is I didn't even know where to start okay so 64 filters 3x3 64 filters then it goes up to 128 filters then goes up to 256 filters then to 512 filters so as the spatial size is decreasing the number of filters is increasing so spatially the volumes get smaller but we have more depth right so so now we have 512 and then at one point you have a fully connected layer here of 4096 neurons as you seen an Alex net that is looking at the last pool volume at this point the image was shrunk from 224 by 224 all the way down to seven by seven but now the depth is not three color channels but it's 512 kind of activation channels or something like that so the total number of memory and parameters by the way if you add it up is turns out that the total memory here if you add up all the numbers is 24 megabytes if you're using floating points then each one of them is four bytes so that ends up being 93 megabytes of weights so 24 million at 4 bytes is 93 megabytes of certain memory of intermediate activation volumes per image when you just do forward pass so that's just maintaining all the stuff in the memory because we need it when we do the back propagation and so just to represent that image all of its forward pass is 93 megabytes of RAM and that's only for forward for backward pass we need also the gradients so where we would then end up with a rough footprint of about 200 megabytes per image just to give you an idea of what the footprint of some of this computation is and the total number of parameters is 140 million at the end when you add up all the parameters throughout this so there are some fun things to note about the asymmetry between where all the memory is and where all the parameters are in the network okay so in particular you'll note that most of the memory of this of this is actually in the very first few convolutional layers you've taken 64 kernels and you went through the image and you ended up with 224 by 224 by 64 and so you end up with three point two million numbers at that point so quite a lot of stuff to maintain here for that single image and so all the memory is here in the early comp layers but all the parameters are here in the first fully connected layer okay and the reason that's here is because we have a volume of seven by seven by 512 and there are 4096 neurons looking at a fully connected manner at the pool layer so you end up with a seven by seven by 512 4096 times not counting the biases so end up with a hundred million parameters just in that layer alone the first fully connected layer so that's not ideal that's why you you just accumulate a huge amount of parameters what we found recently is that you can actually get rid of these fully connect good layers and you'll see that in some of the later work where instead of using fully connected layers people now like to use average pooling across this volume so we'll see this I think in the next few architectures but what works relatively well is the following you have a seven by seven by five twelve volume at this point and instead of being fully connected on this you transform this into a single column of just five twelve numbers and you do that by doing average pooling across the full thing so you do average pulling this way so every single number in the output is a result of averaging 49 numbers in here so you take those activations in just average across and you can use that instead and it works almost as well so you can save yourself a lot of parameters so that's a trick that I think they actually used in Google net which is my next my next slide here so this is Google net from 2014 it's starting to look a little funky so a vgg net has this very simple just a linear structure so in here they they're try to complicate it so the key innovation in Google net was they introduced what we call an inception module and even though this looks kind of funky is just a sequence of inception modules just one by one and I don't think I want to actually go into too much detail on the inception module right now well maybe yeah I don't think I want to do that but I'll just point out that the winner was I was the winner of 2014 challenge is had a 6.7% top-five error so if you remember original Alec's net was at an error of where are we eighteen point two 15.4% so we've come down quite a bit and we're now it's roughly six point seven percent and the Google net you can go through this in a paper I don't think I want to spend too much time on this they have these inception layers instead of convolutional layers what I'd like to point out though is you'll see here at the very end you still have this volume of seven by seven by 124 in their case instead of 512 and then they instead doing fully connected layers they do an average here and so they reduce 7x7 124 to just one by one by 124 and so they save a huge amount of parameters there so they get away with this entire architecture working really well at only 5 million parameters compared to a vgg net which is 140 million or compared to an alux net which is 60 million so you end up with a much less parameters slightly a bit more compute but a much better accuracy compared to Alex net okay so those are some of the trade-offs they were really trying to in this particular work try to reduce the footprint in both the memory and the compute question yeah thank you so you're just trying to get a sense of the scale of these so I think 5 percent top 5 error is roughly 20 percent top 1 error I don't know the in-betweens but that's roughly the conversion okay so that was Google net in 2014 I should mention that they won the imagenet challenge and the vgg net it didn't win it had a slightly higher error but you'll see many people use vgg net instead of Google Ads sometimes because it just has a much more nicer and uniform architecture it's just easier to think about and so in my own work I usually use vgg net as a default as well but some people also use the inception good good questions so we'll get to this I think maybe in a later lecture human human level performance would be at about say about 5% ish and this is some of the experience I've actually performed and we'll go into that in the next in x-lite later lectures but around 5% but if you take an ensemble of humans and you train them for a long time you can get down to maybe 2 or 3 percent or so is my estimate but we'll see that in bit okay so these networks are working very well and you'll see ResNet by the way the winner of 2015 which is I think my next slide so 3.6% top 5 error okay so this is residual networks from Microsoft Research Asia work by coming hey and colleagues and they did not in fact just win imagenet 2015 they want a whole bunch of competitions at the same time so all of them first places in a whole bunch of just very important competitions here and it's with one architecture and so that was very interesting and I'm going to tell you about it now but 3.6% just just wow they had this graph I'm taking some of his slides you can look at the history of ILs VRC which is short for image net large-scale visual recognition challenge so in 2010 we had all these different feature types in 2012 we had the Alex net so we jumped quite a bit down and then we saw progress with the Zeb net then we saw progress with VG net and with Google net and now we're up to three point five seven and you can see that the layers basically just go up and up and up so more layers works better apparently but one thing you have to be careful with as you pointed out in the slides is if you just try to do than the e thing what he calls plane Nets this is everything I've explained to you so far is what he would call plane nets they propose these rest nets and I'll explain to you in a bit what that means now what that term what turns out is that you have to be careful with how you increase the number of layers if you just do it in a Yves Lee it turns out that it doesn't actually seem to work very well so say you take CFR 10 and you have a what we're looking at here as in solid is the test error but the - does it training error okay so let's look at the training error and what happens as you train these layers is the 20 layer model is getting very good training error and then a 56 layer model in red is getting a higher training error and that makes no sense because the 56 layer model is a much bigger capacity model so how can it possibly be doing worse at training error than a 20 layer model so what this seems to indicate is that just we just can't seem to with the current techniques maybe we're not good enough at optimizing these layers right now so resonates in contrast as you scale up the number of layers you just always see a consistent improvement in both the train and test error and so basically all this is saying is you want a lot more layers but not in a naive way you want to do it in a resin that way and so we'll see what that is in a bit so this is just a visualization of this is the resin 852 layers so the VG net here is 20 so it's basically dwarfs all the other previous architectures just to give you an idea about the scale and we'll go into computational considerations soon but you want roughly two to three weeks of training time on a GPU machine so if you guys are not getting some good numbers sometimes working on your laptop then don't feel too bad because this stuff just takes a lot to train and even though it's hundred and fifty layers it's actually faster than a vgg net which is quite quite interesting so the rest net roughly the way it works is here we have a plane net where you take an image and you have a comm pool and say come come come come come or something like that in a resonate what we're going to have in these funny skip connections okay and so in addition to this pass where we're strictly just like transforming every single volume to the next volume we're going to have these skip connections and the way this looks like oh and I just have a note here saying that it's kind of interesting that you take a 224 by 224 image and by you apply a single layer of filters and then you pull right away so you've compressed to the spatial size by by a huge factor by single calmly at 56 by 36 spatially all the other 150 layers are only operating on 56 by 56 array spatially which is quite amazing that you can pack so much information into a small volume and then you can do so well on imagenet with just that amount of spatial resolution almost so the way ResNet works is as follows and we're going to I think go into much more detail in ResNet and we might actually give it to you for assignment 3 but we're not sure ok we might fold it in just how risa this stuff is ResNet literally was presented in december at a conference so this is yes so this is quite new so we're not sure actually if it works we're going to see if it works for our assignment and if it does then you guys will implement it so the way this works is in a normal plain neural network you have some function H of X that you're trying to say compute here in a normal neuro network you would transform your representation so you have a weight layer you get some new representation you track shoulder and so on a sequence of steps in a residual network what you do is your your input flows in but instead of computing how you transform your input into H of X you're computing what to add to your input to transform it into H of X so this two layer neural network in here is just computing a delta on top of your original representation instead of just a new representation right away and just discarding all the information about original X so this is a ResNet module and what's nice about this is you're computing just these deltas to these X's and so one way to look at why this might be nice is if you think about the gradient flow of backwards through these layers in here the gradient has to kind of go through all these weights and backdrop through them but in here the gradient flows in through here and since you're doing addition remember addition just distributes a gradient equally to all of its children so part of the gradient will flow the gradient will flow through this part but will also just like skip over that part and just flow here and so if you think about this architecture here these skipped connections allow your gradient from all the way here on the one hundred and fiftieth layer of the softmax those gradients just skip through and right away feed into this column layer here intuitively and so basically right away you can train right away very close to the image and you can right away train this little layer here that's doing maybe some simple statistics and then these guys in the middle will just learn how to add to your signal in between and and to make it work out at the end another way that this is explained is that this layer is basically by default you're doing an identity operation and then this thing is just computing a delta on top of the identity and so it just makes it nicer to optimize that would be the rough intuition here now in terms of how they train rez nets are just to give you an idea about again hyper parameters that people use in practice they use batch normalization after every single comp lair so you'll so there's hundred M there's many complex many batch normalization is everywhere they use the xavier over to initialization if you're that the paper that proposed this over to initializations I talked about when you use rel layers that's actually the same person coming that proposed that initialization they use a GD momentum with point nine learning rate as point one you'll recall Alec's net uses initial learning rate of zero point zero one the reason you can get away with a slightly higher learning rate I assume is because of The Bachelor of layers they usually allow you to do that so a bit of a bigger learning rate and then divide by ten anytime the validation error plateaus mini-batches of size 256 way decay of one a negative five so these are relatively standard and they did not use dropout because the paper in batch norm The Bachelor on paper claims that if you use batch room there's less of a need for dropout and so I'm not 100% I am i don't have first-hand experience with this but they did not use dropout theatre okay were there questions about question so there's no weights in the skip step or rather there are but not at this level of abstraction sorry do you want to you want to have weights in this skip layer here I think what's nice about ResNet is like suppose this is just an identity where it just flows here if you think about the backward pass it's very clear that if this was exactly identity operation then the backward pass gradient which just skip right through all these layers exactly and get all the way to the first comp later so that's like a nice desirable property and the reason that these networks intuitively maybe don't work is because you're making them it will go through all these intermediate processing and the gradient signal just gets all I don't know loss or diffused or something where it happens not a hundred percent sure and so that's nice that seems like a nice property to have not to put in too much compute but to put in lots of compute but as an optional compute like this compute can come in later during the optimization process if you want to think about it that way good I think adding more layers increases that wasn't the case what about increasing the spatial dimensions of your layers does that always yes so you're asking about increasing the spatial dimensions instead of the number of layers or something like that like where do you allocate your capacity have some finite budget of say GPU memory and now you want to know like what the trade-offs are so it's kind of surprising that in ResNet they take a very rapid reduction in size and then they pack all of their capacity and like in terms of memory into getting more layers and so I don't know how thorough they were with thinking about these trade-offs or with their experiments I have no idea but in this case they seem to go for much more depth at a cost of some spatial resolution very very early on in network ok yeah you had question how do you know when the network is converged so it's never fully converged but at some point you stop caring because it's been two weeks and you're just tired there are some subtleties with that make it work a bit better I have to skip through this through interest of time using one by one comp in clever ways as we saw in Google net as well I'll skip this yeah ok let's yeah let's skip this we only have a five minute so I want to catch some questions but I also wanted to mention that this morning you may have seen case study bonus here you may have seen the deep minds go playing a couple network and this was on the cover of nature and so that's pretty impressive they're playing go which is supposedly harder than chess and this is calm that's playing go and the way that works if you go into the nature paper and you go down to the details of the experiment you'll find the section that identifies the policy network the policy network is basically trying to identify promising moves so it looks at the go board which is here in nineteen by nineteen array and it's trying to put this through lots of calm and trying to figure out where you should place the next stone ok black stone or white stone on this grid and so if you look through this you might be now by now being able to recognize some of the key words that actually piece together the architecture so the input to the policy is 19 by 19 48 it's 48 because they compute I think 48 different types of features that have to do with the specific rules of go and so it's not the raw array which would have been much nicer but they compute some kinds of features on every single position here so 19 by 98 19 by 48 input consisting of for a feature types and then they do convulse kernel size files dried one applies rectifier anyways I'm just making the point that you can basically read some of this and you can distill what the architecture roughly is and so I wrote it out here and so you start off with this input they do five by five filters astride one pad to so they preserve the spatial size and then they do three by three filters pad one so again there's preserving spatial size at 19 by 19 for a long time and then at the end they have just a little 19 by 19 array that is telling you just how promising every single position is to play in next and so that's the main policy network here that is doing a lot of heavy lifting in playing go it's just a large large combat and you train all those layers and they train it on some human data and also some self play data where the network planes against itself and so that's pretty impressive that they can do that and it's come that's doing all this so summary we have these columns pull layers there's a trend towards getting rid of the pool and fully connected layers we're eventually going to end up with networks I think that are just purely calm and the way you do spatial reduction is by doing strides to convolutions instead of doing max pooling there's a trend towards smaller filter sizes like 3 by 3 but having many more depths many more layers but a typical architecture for now looks like comrade liu sometimes pool and then fully connected layers maybe at the end and some soft max at the end and some guidelines based on what you've seen but recent advances I think are kind of challenging this paradigm like ResNet and Google that they kind of play with the architecture a bit and it's all more funky than just stacking these layers in a sequence and so that's the summary I have two more minutes so I can take like one or two more questions if there are any good as far as taking image in getting the output out and then great end is there just generally a trade-off between that or is that not really so you're asking about the computational performance in terms of like a runtime from Lynette's to all these other networks yeah good question so Lynette was much much smaller I don't have the exact ratio but it had many fewer filters so you'd expect it to be quite fast compared to some of these so ResNet apparently is faster than vgg net which is impressive I don't know exactly I can't remember what is VG net forward time for like and a batch of images I don't have it like off the top of our head it's pretty slow pretty slow yeah what is the forward pass oh it's on the order of like tens of for hundreds of milliseconds some are they're basically on a GPU you can run an Alex net on your phone in almost real-time at about 10 Hertz that's roughly also a way to think about it okay cool so or then thank you
Info
Channel: MachineLearner
Views: 32,603
Rating: 5 out of 5
Keywords: cnn, convolutional neural networks, convnet, deep learning
Id: GYGYnspV230
Channel Id: undefined
Length: 79min 1sec (4741 seconds)
Published: Tue Jun 14 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.