Batch normalization | What it is and how to implement it

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

wouldn't it be amazing to have a way of dealing with the unstable gradients problem in our neural networks while also making the network train a little bit faster and also maybe even dealing with the overfitting problem at the same time well if you want that you're in the right place because today we're talking about batch normalization this video is part of the deep learning explained series by assembly ai a company building a state-of-the-art speech to text api and if you want a part of it you can go get their free api token using the link in the description we will first talk about how batch normalization works and how everything works under the hood and then we will talk about some benefits of batch normalization and why we would choose to use it and lastly i will show you how to implement it in python code using keras the first thing that i want to clarify is the definition of normalization so you might have heard in a lot of different places that you need to normalize your input before you feed it to your neural network or you need to standardize your input before you fit it to neural network these terms sometimes are used interchangeably and they're not really strictly defined so but let's define them here just so you know what i mean when i say normalization or standardization normalization is collapsing the input range that you have to be between 0 and 1 whereas standardization is changing your value so that their mean equals to 0 and the variance or standard deviation equals to 1. so on like a little visual what it would look like is let's say if we had some values that go from 0 to 100 let's say they were 20 70 and 90 after we normalize them they're going to be between 0 and 1 so then they'll probably be 0.2 0.7 and 0.9 still keeping the ratio that they had to each other whereas for standardization if we had something similar at the end we're going to change these values so that when we put them in a distribution their mean is going to be at zero and the most amount of values are going to be between -1 and 1 and then as you go further in the distribution you're going to see less and less of these values so why do we need any sort of normalization for our neural networks to begin with so let's take this example let's say we have a neural network and we want to feed data into it our two features are a number of phones that were ever owned by someone and the number of um or the amount of money they've withdrawn from the atm today so as you can see they have very different ranges one of them goes from two to twenty four and the other one goes from zero to a thousand so what's going to happen if you feed your network the unnormalized version of this data is that you're going to make it very hard for your network to learn the optimal weight values that minimizes the cost or the error and in turn you will also cause your network to have weights that are very different than each other so the weight that we are going to multiply the number of phones with is going to be very different than the number of then the weight that we're going to multiply the input that is the amount of money that was withdrawn from the bank account so in turn you might cause your network to also be unstable and in the end have a network that has a vanishing or the exploding gradients problem so what we do to overcome this problem normally is that first of all of course we normalize our inputs and we also try to use the correct weight initial initialization technique and also the correct activation function that goes with this weight initialization technique but even if you do everything correctly unstable gradients problem might come back later in the training but there is one solution that could save the day and that's batch normalization so what we do with batch normalization is instead of only normalizing our inputs and then feeding the data into our network we normalize all the outputs of all the layers in our network so in this diagram you can see we have our network and in each in between each layer we have a batch normalization layer so what the it does is basically normalize our data and do a little bit more of a small trick on top of it and then feed the data or feed the output from the previous layer to the next layer so let's see how that works in this small example let's say we have six data points they go from three to twenty four and we have three five eight nine eleven and twenty four what bash normalization does at first is to standardize them based on what we were talking in the first lesson you can call it normalization two but what it does is to make sure that their mean is zero and their variance is one so it recalculates them and puts them in the correct place but after it does that this is not the end of what batch normalization does it also scales and offsets these values by some amount that is going to be determined based on the training process so as you can see here and this is kind of like the last step the formula from the last step of batch normalization we have the values that have been changed already that have been standardized and on top of these values we multiply them by some value which is called the scale and we also add another value to them which is called the offset these two values are basically trainable parameters we're not going to determine them they're not hyper parameters or anything we're not going to determine them before the training starts these are going to be learned like any other parameter in the network like the weights and the biases so what it would look like if i scale this value these values that i have right now if i multiply them by two if i scale them by two where this is going to basically be multiplying by them by two and what else you can do is to offset them if you want to offset them by 0.5 and then it's going to look like this it's basically going to be sliding them a little bit on the axis that they're on so this is what bash normalization does to kind of find a good transformation a transformation that works for these data points to help the network overcome the unstable gradients problem and in turn it actually makes it train a little bit faster well you might say how does that work there are so many extra calculations that we need to do in between the hidden layers how do we end up having a network that trains faster and you're right what happens is when you're training a network that has batch normalization the epochs take longer every epoch takes a bit longer than it would have if there were no batch normalization but in the end batch normalization helps us achieve the same accuracy that we did without having bash normalization with less epoch so at the end the amount of time that we add because of patch normalization is much less than the amount of time that we save because we added batch normalization and not surprisingly when you can train your network with fewer epochs to achieve the same accuracy that you did without batch normalization you can of course train it a little bit more and maybe even achieve better performance and on top of that because this is a normalization layer if you'd like you do not have to separately normalize your or standardize your inputs before feeding it to your neural network but you can just have a batch normalization layer before your first layer and then effectively your impulse will be normalized so you can keep everything in one neat package so that's another advantage of using mesh normalization and lastly it was seen that batch normalization actually reduces the need for doing regularization if you remember regularization is something we did to deal with overfitting but with batch normalization you don't even have to do that anymore but of course you might need to try this out for your own network and then see if that's actually the case or not but it was shown that it is actually one of the other benefits of batch normalization so that was all that i want to say in terms of what how batch normalization works and the benefits of batch normalization now let's see how we can implement it using keras and python i will show you how it works using the mnist dataset so really classic example of handwritten digits here i'm just importing the libraries that i need and the dataset from keras and this is what the data points look like so this is one example of data it is a 28 to 28 image so that means there are 784 pixels each of these pixels have a value so this this value goes from 0 to 255 and the lower the value the darker the pixel and the higher the value the lighter the pixel so if you look at this example probably this one this dark one here is around like 200 whereas this one is probably 70 and the actual fully white ones are going to be 255. so what we want to do before we feed this data set to our network is to normalize it one way of doing it is basically just dividing all the training values to 200 255 and then effectively you're going to have a network or a data set in your hand that goes where all the values go from zero to one and later you feed that data to your network that you created here and train it as you wish let's look at what our network looks like we basically have one flattened layer that takes a 28 to 28 matrix and then flattens it to be one long list of 784 values and then we have two hidden layers one with 300 neurons the other with 100 neurons and a output layer with 10 neurons so what if i wanted to have batch normalization in here well it's very simple actually all you have to do for batch normalization is to add one layer and this is one of the predetermined or predefined keras layers and that is called batch normalization you just need to put it in between two layers where you want it to be so i can also put it here after the second hidden layer and now my network has batch normalization but as i said if you need to normalize your data and if you're doing it manually you can exchange that to use instead batch normalization and how you're going to do that is basically before you feed the data into the hidden layers you just need to have a batch normalization layer so by doing this after i flatten my input i am putting it through batch normalization so the values that are going to be fed to the first dense layer are going to be normalized so i do not have to do this anymore so this is one advantage of using batch normalization everything in one place you don't have to worry about manually normalizing separately there is one other detail that you should pay attention to while you're implementing bios normalization and that is deciding to put batch normalization before or after an activation function the authors of the original bios normalization paper spoke favorably about this technique of using bash normalization before the activation function but this is something that you might want to try out and decide for yourself if it works for your specific system and specific problem but i'll show you how to do that if you wanted to so basically when you have a dense layer the activation function is already included in here we specify that it needs to be the really activation function but if you wanted to you can have your activation function as a separate layer so if i did this that means then whatever was outputted from the batch normalization layer will be fed through an activation function then i would not need to have an activation function anymore in this layer and i can do that for the second hidden layer too then i would be taking the output of a layer through batch normalization first and then through activation and this is something that people argue that can work and might be better for your network but there is one other detail that we should look into here and that is to usage of bias so if you remember what happens in a dense or layer or a hidden layer is that we get some sort of input from the previous layer right let's call it x and we have our weights so we multiply the input with the weights and then we add a bias to it so when we have an activation function also already built in what we do is we put these values through an activation function and that is the output of our dense layer so if we strip the activation function out of this that means what is going to be fed to the batch normalization layer is going to be the this is going to be this value but what if we what do we do with batch normalization we normalize the values and then we scale them and then we offset them if you remember and of setting is basically the same as bias you just add one value to it so at the end you do not really need your biases anymore you can just train the offset values to find its optimal value inside your neural network rather than all having a bias and an offset so that kind of like helps you have a lower amount of parameters and also helps you train your network attach faster so then all you have to do is inside your dense layer you just say use bias false because you don't want to use any bias here but that's it when it comes to implementing batch normalization it's very simple it's just one extra layer that you can add if you're using keras to build your network uh just realize that you can use it as a normalization layer without the separate manual normalization that you need to do and also make sure that you decide if you want to use it before or after the activation function of your layers thanks for watching and i hope you enjoyed this video if you liked this video don't forget to give us a like and maybe even subscribe because we're going to be here every single week if you have any questions or comments i would love to see that in the comment section also if you'd like to integrate speech to text capabilities to your own projects you can go grab the free api token from assembly ai using the link in the description but for now have a nice day and i'll see you around

Info

Channel: AssemblyAI

Views: 56,707

Rating: undefined out of 5

Keywords:

Id: yXOMHOpbon8

Channel Id: undefined

Length: 13min 51sec (831 seconds)

Published: Fri Nov 05 2021