Group Normalization (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi there today we'll look at group normalization by Yu Xin Wu and coming her of Facebook AI research so this paper is basically an engineering paper about a new normalization technique called group normalization so what's the issue here the issue is that pretty much throughout neural network learning we're using this technique called batch normalization now batch normalization is a pretty reasonable thing and it works very very well so what's the idea behind batch normalization the idea is if you have data points for machine learning methods and your data is in a 2d coordinate system somewhere down here right and you're trying to separate that from the dots which are here it is often very beneficial to shift that distribution before you do anything you want to shift it to the middle of the basically want to Center it first of all such that the 0 the origin point is in the middle of the data and then sometimes you also want to do what's called normalize it and by normalizing we mean you want to kind of rescale the axis such that things are more or less sort of like gaussians so if you if you look at this distribution first is the centering and then second is what is called a normalization normalization and usually we know that any sort of machine learning methods work better if you do that and that's mostly in classic machine learning methods with conditioning numbers of the data being better and so on but if you just want to learn let's say a linear classifier you can see here you can even save one parameter because you can make it just go through the origin and that's true in general so if we draw this in 1d you'd have a distribution that maybe is very peaky right here you first center it to the middle of the coordinate system and sorry that's not really centered and then you would divide it by its standard deviation such that after it it is a unit standard deviation Gaussian so a normal distribution the closer your data seems to be to a multivariate normal distribution the better these machine learning methods work especially if you look at how signal in deep network is propagating through the layers so the idea is if it's good for the general machine learning method that the input has a multivariate normal distribution or is normalized then it's probably good that the input to each layer is normalized so when you look at how signal features are in between layers so this is for example the 5/3 this is a layer somewhere in the middle of a convolutional neural network and if you look at the spread of how features are feature signals are throughout training you'll see that the more training progresses the larger the kind of spread of features is so you might get really large numbers or really large negative numbers or maybe really small numbers in your neural networks and it would be better if you had a layer and the input you've normalized it in red and the output then is again a distribution but it's maybe shifted that you would first transform that back into a normal unit normal distribution before you put it through the next layer so what batch norm loss is at each layer before each layer it will do a normalization procedure on the data before giving it to the next layer and you can do a basically back prop through that it's also common to learn bias and variance parameter to add after that but the important thing is that after each layer the data is normalized such that it is kind of in the most comfortable regime what's the problem the problem with this is that you actually need the distribution right if you want to Center this data up here you need to know what the data is so you need to know the entire data if I want to figure out what is the mean of this distribution I need all of the data points they decide here's the mean I need to shift that up to here if I just have a mini batch like we usually do in machine learning so if I just have this or this and this and this point I just have four points I can't determine the mean but what I can do is I can sort of guess the mean from the four points right so my guesstimation of the mean would be somewhere here and that would be usually close enough and you can also see that the larger your batch is if you sample at random the more accurate your mean estimation is going to be so people have been training neural networks with large batch sizes for is basically batch size I've gotten larger and larger in the last year so that has not been a problem but what people are doing now is they are doing distributed machine learning where you do have your dataset and you draw a batch and the batch might be large so that this might be I don't know 1 million images this might still be four thousand images in your batch but what they'll do especially with things like TP use is they'll distribute that across many many many machines into batches of sometimes as small as 8 samples per per unit and if if this is not images but maybe something longer like a sequence of text or if this is a sequence of speech or something like this you can sometimes even go down to two or one samples per per unit of computation and of course you can't do batch normalization you can't calculate a mean of one sample it's just going to be that one sample itself so either you have to you have two options if you're in small batch sizes let's say two either you take the hit and have your very bad estimate of the mean from just two samples or eight samples or after each layer you basically do a synchronization step such that everyone communicates to everyone else their statistics and you can basically aggregate the statistics across the batch both are not cool usually these frameworks they don't do this synchronization because it's just too slow so they'll go with the bad statistics and you can see this right here in this graph they have the imagenet classification error versus batch sizes so this is a ResNet 50 model trained on the imagenet data set using 8 workers so 8 GPUs and if they do 32 images per now just look at the blue line here if they do 32 images per worker so it's 8 workers it's 8 times 32 that's the batch size that is a large number 256 maybe yeah all right so so if they do that then you can see the error is on a state-of-the-art for a ResNet 50 if they go to 16 it's still pretty good but then as they go lower and lower and lower so if they go to smaller and smaller batches and spread them out over the workers then the error starts going up and this is because the batch norm statistics get worse and worse so the goal of this paper is to find this group norm thing here the group norm they this paper claims is another normalization technique that pretty much does the same thing this centering and the normalization the scaling but it does it without relying on the batch statistics it basically does it within a data point and that means that the performance even though it's a bit smaller at the beginning for this particular example will stay constant with a even in small batch size regime so this is potentially applicable as I said to things where you have to go to like two or one sample per worker because it's just the data points the single data points are just too large so if you maybe want to train something like Bert on a on a on a GPU so what is group normalization group normalization as I said works within a sample now there have been other methods that work within a sample instead of across the batch and they tend to not work as well as batch norm now this paper here claims that group norm works on par with batch norm for large batch sizes and better than on small batch sizes so here they have a schematic of what's happening in batch norm as you can see here you have this this cube now this cube here n means the batch size so this are the data points points in your mini batch this is the thing that is going to get small in the if you don't have enough memory then C would be your channels so we are talking about convolutional neural networks here but this is generalizable to other neural networks the channels are going to be the independent feature Maps that you you're going to have so in a convolutional neural network usually each layer has these things called kernels and there might be 3 by 3 matrices like this and the if you have an image the kernel will be slided this thing right here will be maybe here will be slided across the image or slid is it slid okay will be slid across the image and then the numbers in here will be convolved with the pixels and that will give you the next layers representation so whatever the convolution operation is and you slide that over and that sliding over will give you the values in the next layer now you not only have one kernel but you actually have many kernels very well let's draw that so you have more and more kernels [Music] zip you have a whole stack of kernels and how many kernels you have those are the different kernels are also called your different channels now the kernels refer to the weights and the channels refer to the image but the the if' kernel is going to be convolving the eyuth channel of the image so at the beginning the input image has three channels because red green and blue but then the intermediate images can have more channels as you have as basically as many as you have kernels in the layer right before okay and the H and the W means the height and width of the image so it combined so the image is kind of unrolled across the height or the width in this direction so what does pattern on do batch norm takes as you can see here one channel and it it takes one channel so maybe this image this is one channel let's just say this is the red Channel because I'd drawn it in red it takes that and it calculates the mean one over and the standard deviation of that it calculates those two statistics and it uses that to do this centering and scaling operation so all of these methods are going to calculate the mean and the variance and then do the same scaling transformation the question is just how do you calculate the mean patch norm does this across the data points so it looks at a single feature at a single channel and it asks what's the mean across all the data points what are the data statistics of this channel and whether it was the mean and standard deviation now actually batch norm I'm not I didn't even know that in convolutional layer this works like this you can also imagine Bachelor of really just taking one single feature and that means of really just taking one of these things right here so if this goes to the back and normalizing across that the important part is that it is in fact normalizing across the data points so it looks at your batch look at the mean and the variance in that batch and it normalizes by that I think convolutional layers make sense because you have this invariance in height and width and therefore yeah so that makes sense but in a fully connected layer you'd simply go you look at one feature at a time layer norm is different layer norm has been basically been proposed as an alternative to batch norm with the same reasoning that this paper has so lave norm as you can see here it does each data point individually so here we only have one data point that is normalized by itself so you do this for each data point independently and therefore it's not dependent on the batch size anymore but what you'll do is you'll look across all of the channels right here so all of the channels and all of the width and height so this entire thing here this entire thing is basically one channel right and then the next channel is here of the image and the next no that's the next image well that is a bad drawing because the image is unrolled in any case what you'll do is you'll look at so if you have a filter Bank like this you have an image and the image composed of multiple channels right this is the red and then you have the green right this is in the green and then you'll have the blue Channel and what you'll do is simply you calculate the mean across all of the pixels and across all of the channels you just take this whole numpy array and you just say dot mean and that gives you one number and it's just whatever that number is you subtract it and then you say standard deviation and you divide by that that's layer norm so an entire layers representation of one image is just normalized to the mean now this seems a bit drastic and that's why instance norm did the exact opposite they said wait a minute instead of normalizing across all of the features right we'll go back and do what batch norm does batch norm looks at each feature individually so basically it looks as all of these these different axes in the data distribution it looks at them differently so if one axis is scaled very widely we want to normalize that differently than if then the other axis that is just scaled very shortly and that's why we'll look at each feature individually like batch norm but also we only look at one data point at a time now as you can imagine this doesn't work anymore in in a fully connected Network this basically works in a convolutional Network where you have a feature map channel so you look at one individual Channel and one data point so that means here you would normalize the red channel individually you would normalize the green Channel individually and you'd normal normalize the blue Channel individually so the image you're gonna end up with is simply the red Channel subtracted by its own mean and then divided by its own standard deviation and just within that data point right so maybe I should here say across the number of features or something so I hope that that's clear so the layer norm drops the dependence on the batch size but instead says we should normalize across all of the features and the instance norm says wait a minute batch norm had a good idea normalizing only across the features individually because the individual features might have different scales and we should account for that but also we don't want to be dependent on the batch size and now is this where group norm comes in group norm is basically a mix between layer norm and instance norm what group norm says let your norm and instance norm have good ideas they only go across one sample they take that they say in essence instance norm has a good idea in that the features should be normalized individually but it goes sort of too far from it goes too far you might get not good enough statistics because you're now normalizing each of these things individually whereas with layer' norm you're too restricted you're basically saying that the features it's fine if the feature is relative to each other or like this right one is might be very high variance and one is very low variance feature norm would keep that and group norm would say maybe that's not it's not so good we should have we should normalize the individual features maybe individually but they'll they're argumentation here is that maybe there are some features that by their nature already have the same sort of scaling and variance they give an example if you for example have a filter again we deal with convolutional layers here and that filter is a let's say an edge filter right so a horizontal edge filter so it's very low value here and let me mark the high value with blue so this is a horizontal edge filter if you slide this over a window and these are high numbers and these are low numbers it will respond to edges because edges have high low high right or vice versa so it will give you very positive and very negative number every time you slide across an edge now you can imagine that in natural images that filter whatever image you put in would and however you normalize would give you pretty much the same response as a vertical edge filter so the horizontal and the vertical edge filter you you see whatever their response is they're probably about equal in size so we could expect that in a neural network there will be groups of filters that together exhibit the same scale and therefore we can normalize across them like in layer norm the so the more things we normalize across the better statistics we can gather that's why instance norm doesn't work because it only normalized across a very small thing getting very little statistics we should normalize if we could gather good statistics we should normalize different features differently then group norm says well since some of the features are almost guaranteed to behave the same we couldn't normalize across those now of course you don't know at the beginning which ones those are so but you you hope that by doing group norm by basically at a priori so at the beginning of training you decide what the groups aren't and naturally it's just whichever ones are next to each other those are the groups and you'll hope that through the training procedure basically those groups will learn the features that are equal of size well you basically enforce that so you kind of constrain the architecture to do that so that's the idea behind group norm you basically build these groups of channels and then you normalize across those across the groups of within the groups of channels across the entire height and width only in a single data point and therefore you gain the advantage of layer norm of normalizing within a single data point you retain the advantage of batch norm of of normalizing across single features and that's what instance norma tempted but yeah so you get the best of both worlds sort of that's group norm and now we go and look what it does so they say okay basically all the normalization techniques do this they subtract the mean and divide by a standard deviation that's what we saw and the difference is just across what you collect your statistics so the group norm is the following code in tensor flow as you can see you simply reshape your data and basically expand this part right here where you build you where you put the extra so this is C this entire thing used to be C and you divide it into group and index within group and then you just normalize across that and reshape to the original demand again and the important the cool thing is in batch norm you have to keep track of these of these running means because at test time you sort of don't want the batch statistic to influence anything you don't have that here you can just back propagate through this observation through this operation and you don't need to keep these running running averages going and you always care are am I in test or am i in train mode right now you just do this this operation is per data point so it's just part of your model right and they do a an experiment where they have 32 images per GPU so it's reasonably sized and they can basically show that the group norm and the batch norm they compare in their performance now I usually don't believe the experiments that you see in single papers but I think this has been replicated a couple of times now you see this is the Train error where a group norm even behaves a bit better and then in the validation error it behaves a bit worse but one could say it is it is kind of more closely together than the other methods are to the group norm or to each other these instants norm and layer norm so it at least it's better than instance norm and layer norm and then once you go into the smaller batch size regime of course that's where the group norm starts to shine so if you go from the 32 images per GPU which is this low black curve here all the way to two images per GPU and I believe they think they could even do one image per GPU with group norm but of course you can't do that with batch norm because you need batch statistics you can see that the the performance of batch norm degrades drastically whereas with group norm an if this experiment is just funny they just have to do this even though you know exactly what turns out so look at the lines are all exactly in the in the same place I mean come on like now you know you're just having time to probably one of the reviewers was like what did you really do the experiment to put it so yeah so you can see that the the batch norm beats the group norm in this setting with the when you have the larger batch sizes but the group norm pulls ahead quite drastically when you have the smaller batch sizes and that is the main advantage so now you can turn to models that require small batch sizes or a small batch per worker and generally it's a pain in the ass to just keep track of those statistics for test time they do verify which I find pretty cool that this phenomenon of the response is going apart during training in the internal feature Maps a bachelor room counteracts that so with batch norm you'll get actually a convergence of responses during training so the more you train the more normalized basically your internal features will be and they show that this is exactly the same with group norms so group norm is as it seems it is a replacement it's not an addition it doesn't the gains don't come from different place it seems to be a substitute for a batch norm though they don't have an experiment where they do both I believe maybe I'm wrong actually maybe they do but yeah it seems like you just kind of have to bring some calmness on some standardization into your signal and how exactly you do that doesn't seem that important as long as you do it with some precision and some some real overall statistics yeah what I don't like about this is now you have of course a new hyper parameter which is this number of groups right so that that seems rather annoying and the gains like this usually come from the introductions of new hyper parameters and and that just it's not so it's not that ideal for a method to introduce a new hyper at least lair norm and instance norm didn't and now as you can see the number of groups is is not super influential but does have a bit of an influence on the on the performance so if you go a number of groups or here a number of channels per group of course these two numbers are inversely related to the more groups you have the less number of channels per group we have if you go to one extreme you will get to the lair norm basically so the lair norm is an extreme case of group norm where you just have one group all the channels are in the same group then the performance as you can see here is quite a bit worse if you go to the other extreme where every channel is its own group that's equivalent to instance norm again the performance is quite bad and somewhere in the middle here with 32 groups is seems to be a good sweet spot so I don't again i don't like the hyper parameter seems to be some somewhat of a thing where you really have to hit a good value and well I guess we'll see over time if that value is always going to be about the same you know like the beta 2 of atom it's it's always like people never change it from 0.999 because it just tends to work or whether that's really going to be another hyper parameter to fit that seems to be annoying they do a bunch of ablation studies and tests on as we said the for example object detection and segmentation so so models where you must go almost to small batch sizes just because so video classification so if you want to classify an entire video that's a lot of data right and you almost have to go small batch sizes for that they do a lot of experiments and generally as I said I believe these results in for group norm have been replicated and across the community a bunch of times now and I would definitely consider group norm if you are thinking of a especially a distributed machine learner project alright with that I hope you enjoyed this paper I've been talking for way too long now I wish you and I stay if you haven't already please subscribe like share comment or whatever you feel like doing bye bye

Info

Channel: Yannic Kilcher

Views: 29,915

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, batchnorm, groupnorm, layer norm, group norm, batch norm, instance norm, fair, normalization, mean, standard deviation, minibatch, batch statistics, kernel, cnn, convolutional neural network

Id: l_3zj6HeWUE

Channel Id: undefined

Length: 29min 6sec (1746 seconds)

Published: Tue May 12 2020