Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hi today we're looking at batch normalization accelerating deep network training by reducing internal Cove area chipped by Sergey I off and Christians get as skittish is he ready yeah not my not the best pronounce ur Sajid e closed off alright so this is a bit of an older paper and it's I think it's still good to look at it it's relevant and people just kind of row batch normalization into networks and maybe don't really know what it's doing so let's look at it so what these people argue is that in a network usually you have structures like this so if something like that it means that you're you're lost kind of this a two layer Network your loss is a composition of the first layer on the input to you with parameters theta one and the second layer with parameters theta two so conceptually that would look something like this you have your input maybe it's an image right and you put it through the network and it becomes some intermediate representation right that's X 0 that's x1 and or maybe we'll call it even H h1 hidden representation right then that becomes then through the layer becomes H 2 and so on right so these the this stuff here these this would be weight matrices w1 w2 that transform the image into a new image or whatever so what they're arguing is that well if you only consider a single layer like the first layer here it's kind of the same if you only consider the second layer with the h1 now as the input right it's pretty natural to see each layer of the neural network is kind of like its own transformation taking inputs and producing some outputs so what people usually do with the very first input here with your data in machine learning generally is so-called whitening the data which means that they have this over here usually data is whitened I can't find it but what it means is you basically wanna if you have data let's say here is a coordinated access if 2d data and you wanna you might want to do a kind of a linear regression on it and you have data that's kind of like that alright it it suits you to transform this data in to by first of all looking where it's mean is mean is about here and subtracting that so here here and then kind of dividing by its standard deviation in each direction so there's a standard deviation here and there is a standard deviation here so you would transform this data into something like maybe something like this so you see that the mean the mean is now in the middle and the it's not so elongated anymore so if you have a much easier time to kind of learn learn something on this data then on this date over here simply because our classifier is usually tend to rely on like inner products and if you if you do an inner product here you have one of these vectors here and you do some inner product it's always going to be far away from from the mean and thereby the inner products are going to be large no matter what right whereas here if you take a random one and then another run so two round if you take two random points here there are two vectors from the mean are almost the same whereas if you take two random points here they tend to look you know uniformly in in the directions so it's kind of the sense we know that machine learning methods work better if you whiten the data first so these people ask hey why why do why do we only do this at the very beginning right why don't we why if each layer is basically takes its input and learn something each layer is basically a machine learning method why don't we just whiten the data to every single layer or you know every single sub component of a deep network and that's the kind of basic step here so they argue how this has been kind of tried before or what kind of methods you would usually get and why these aren't so good mainly because you kind of need to intermingle this whitening with training the network and thereby if you just go about this naively then you would not you are not um you would kind of produce artifacts from training so that's that's this section this section here where they argue that it's not really you can't really go about this super naively but what they do isn't super complicated but they just do it in a smart way so we'll jump directly to that what they say is okay let's look at what they call normalization of you have mini-batch statistics all right let's say we have a some some D dimensional input X right and we're just gonna look at per dimension so we only care about per per individual dimension normalization all right so what are we get what do we need to do when it take the KF dimension we're going to subtract from it the mean of the KF dimension within a mini batch right within a mini batch of data so many much maybe something like 32 examples or a hundred examples or something like this and then we'll divide by the variance of that mini batch [Music] so this is this is done over here in in basic so you compute mu B mu of the mini-batch which is simply the empirical mean of that of the data at that particular layer and then you compute Sigma squared B which is simply the the empirical estimate of the variance of that of computed on that particular mini-batch and then you transform your data by subtracting that and by dividing it by this and this is this constant here is simply to prevent from dividing to by you know to to small values so you get like in numerical problems so what does it do it does basically what we what we did above but now what they say is okay we want to make sure that this transformation can potentially you know represent the identity because sometimes or like a natural natural if you had to do something with your input when giving it to the next layer like the very baseline is to do nothing to it right to do the identity transform but if you do if you do this you probably won't end up with the identity transform except if the mean is exactly 0 and the variance is exactly 1 right so what they say is okay we'll also introduce two new parameters to this here is this gamma and this beta here and these are learned like other parameters in the network we learn the parameter gamma and better and gamma and beta are simply gamma is simply a scalar that this transformed x is x and beta is simply a scalar that is then added to it so in each dimension of your hidden representation you basically learn how to scale it and how to shift it scale and shift after you've done the normalization so first first you do the normalization where is it right first you go from this type of data to this type of data and then you say well maybe it's actually more beneficial to you know have it not centered or whatever said so that the network can actually learn them to transform this somewhere this might seem this might seem redundant but it's really powerful because what you're basically saying is that okay this probably isn't the best you know distribution this probably is better but if the network kind of if the backpropagation algorithm or the training algorithm decides that this first representation was actually useful it has the option of going back but it also has the option of going to any other kind of form of distribution so so it's it's pretty powerful in terms of what it does okay it's not really correct here that it has the power to go to any distribution because it's only kind of her dimension scalar that it learns but still eat the potential to transform the distribution by these learned scalars is is pretty big all right so basically that's it that's that's that's the whole that's the whole shebang you normalize your inputs to each layer by this formula and then you introduce new parameters that you learn along with your network parameters so this kind of has some implications first of all one implication is this here if you build a batch norm into your network it kind of learns this this post beta which is basically a bias parameter if you think of a traditional kind of fully connected layer this isn't a fully connected layer because this scalar here is only per dimension but the bias in a fully connected layer is also just per dimension so the beta is equal to a bias and a fully connected layer so if you have a batch normalization after or after a after a fully connected or convolutional a or anything that can or sometimes has a bias parameter it's almost not worth it to kind of learn both so you would rather just only have the one from the Batchelor normalization and leave and use the convolution or fully connect layer without a bias so that's kind of one implication another implication is we have just lost the kind of the the ability to have deterministic test time inference so much like dropout which is kind of random dropping out of nodes here we have quantities that depend on the mini batch so not only the individuals sample but they actually depend on what other samples are randomly selected to be trained with that particular sample so that's that's kind of awkward if you kind of want to have some deterministic reproducible thing at test time so what people do is and here this is discussed what people do is while training they use these quantities they did the quantities we just discussed but they keep kind of a running average over them so what I would do is in each mini batch I would compute this mini batch mean and this mini batch variance and I would keep quantities I'll keep running averages of them right and at test time I'm gonna plug in these running averages so there's nothing dependant on the mini batch anymore that's so that that's that's pretty neat trick I think and you can even imagine like at the end of your network training simply using these here to kind of fine-tune the weights to these exact parameters so that's one thing that's that's kind of you have to pay attention to so in usually in neural network libraries there are there are parameters you can set whether or not this network is in train mode or in test mode and depending on that the batch norm layer will use the mini batch statistics or will use the kind of over dataset statistics all right the second thing is training so how do you actually train this thing because now you can't you can't just right we we started with our with our multi-layer network up here right f2f1 right first I'm gonna put my things through f1 and then I put my things through f2 right and the the back propagation here is is quite easy so let me get rid of this the backdrop here is quite easy you go oh and maybe you want to drive it by theta1 right so you first going to drive it by the hidden representation one and then the hidden representation one with like two theta one so the hidden representation would be whatever comes out of here or h1 sorry not I and so on so you kind of chain rule your way through here but now in between these layers here you have these patch norm things and so that the authors discuss how we now do back propagation in the face of these things so here is basically what they discuss it actually pays to to have a graph of what's going on so here is X this is the input to our layer right so what do we compute from X we compute mu let's just call it me or maybe it's called here all right this is the mean of of all the X's so we this is X X I until X well X one until X n this is the mini-batch we compute the mean and then from this and from this we can compute this estimate of the variance right we need both all right so we now have the mean and the variance over the mini-batch so we're going to take one of these X's just a if one right and we're going to use this and this to compute X what compute X is it called hat yeah probably it's called X hat right yeah we saw about X hat so X hat is X X hat I is x I minus mu B divided by Sigma squared be the square root of it plus this kind of little constant here well leave away the little little constant for clarity sake actually it's in the calculations here but so um then we have new parameter gamma right and we're going to use it and our X have to compute and also this beta here to compute y hat why or why just why and of course this is I this is I right so and this here is our final output of the layer so you can see now the back propagation paths if you go through here so the back propagation path if we have some loss coming in here we back prop through why I right so here is the L the loss to why I that's here right so if we want the for example the back prop with respect to beta what we do is we simply and this is this is over the mini batch of course we simply back prop here through this path so in our in our formula for beta there should be only mentioned why I and that's what we see here right in our formula for gamma there should only be mention of Y I so because the path leads only through why I oh no I'm sorry actually because they of the what I mean is of the derivative with respect to Y I of course the we also have to pain take into attention that this is multiplied here by this X hat I wear of course that's not the case when we just add something because the derivative of two of of an addition like X plus B with respect to B disregards X whereas if it's x times B it doesn't this restored this record X all right so if we yeah so you can you can go back so the interesting that basically comes when we want to find out okay how because here is here is another layer right down here somewhere there is another layer and we basically want to know this input here to the next layer how do we compute it in the face of this mess here because we it's not it's not so easy right so you have to see we have three paths here we go back through X and let me get rid of this blue one we go we go back through X hat directly to X we go one path is through here and one path is through this this mu so basically have to compute derivatives with respect to Sigma squared and mu and for that we need to derivative with with respect to X hat so basically the way back prop works is you just find all paths from where you are to where you want to go and then you you kind of intuitively compute this so this one here's the easy T easiest as you see here they did it on top well first they did this one which is simply going from Y to X hat I start then they go from X hat I to Sigma squared which simply involves kind of the reverse operations of how you got it this is simply a derivative formula here of the of the division by square root then you can use this you can use this quantity here to compute that so basically you just go in reverse of how you computed the operations in the first place we said we needed mu be to compute Sigma squared V now we need the derivative with respect to Sigma square B in order to compute the derivative to MU B and once you have to only see the addition here the ad here is the fact that oops is the fact that two things contribute to MU B so two paths lead to veto movie one path is from from here and one path is through here right so here there should be a green since two paths you have two components to your derivative and you add each of them so that's how that's going to be and then this here with respect to this X here we have three paths right because we have three arrows going out of X I one here one here and one here so I have to take into account all of them alright so this one it's pretty easy that's the first one then the second one sorry this the second one goes through this UV which we've already computed and the third one goes through the Sigma which we've also already computed right and these are added because all the paths you have to all add all the paths in the backdrop algorithm maybe we'll do a actually video on that prop later too to get to really dive into how this works and finally they they compute these these we've already discussed so in essence the whole thing is differentiable you just have to kind of pay attention how to do it but the whole thing is differentiable and thereby you can basically back drop through a network that has these batch normal ears in a built-in so that's pretty cool I just want to quickly jump over to the results and yeah keep in mind this paper is from 2015 so networks weren't that big back then we didn't know that much about training yet but the interesting thing is they basically discovered look we can we can have drastically fewer steps in order to reach the same accuracies and these are kind of the activations of the network over the course of training so with a platinum you see especially at the beginning there's large fluctuations in the activations and because because they use batch norm now there is no such thing so basically the reason for that is pretty is pretty simple right while you learn and you learn your layered representation here let's say there's X and X is fed through layers and there is hidden representations each in between right so you're trying to learn all these parameters let's say this one here w3 but at the beginning of training everything is kind of prone to shifting around a lot so when you change w1 that kind of changes the entire distribution of your hidden representations after the fact so basically whatever you learn for w3 is now already almost obsolete because you've changed w1 basically and w3 was kind of assuming that its inputs would would remain the same because that's what you assume in machine learning your input distribution is kind of the same so that's why at the beginning of training you see these kind of large variances and with batch norm this tends to go away so that's pretty cool they also kind of show they mainly show that they can reach the same accuracies as other as other training methods but with much much fewer steps and they can go much higher learning rates than others so um because because of that so that's pretty cool I encourage you to you to check out the rest of the paper use batch norm in your network sometimes works sometimes doesn't work strangely enough but you know I guess that's just a matter of experimentation alright that was it for me bye bye
Info
Channel: Yannic Kilcher
Views: 14,098
Rating: 4.9636364 out of 5
Keywords: machine learning, deep learning, neural networks, batch normalization, batchnorm, whitening, data, internal covariate shift, deep neural networks, deep nets, mini-batch, training
Id: OioFONrSETc
Channel Id: undefined
Length: 25min 44sec (1544 seconds)
Published: Sat Feb 02 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.