Vanishing & Exploding Gradient explained | A problem resulting from backpropagation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey what's going on everyone in this video we're going to discuss a problem that creeps up time and time again during the training process of an artificial neural network this is the problem of unstable gradients and is most popularly referred to as the vanishing gradient problem so let's get to it [Music] all right what do we really already know about gradients as it pertains to neural networks well for one when we use the word gradient by itself we're typically referring to the gradient of the loss function with respect to the weights in the network we also know how this gradient is calculated using back propagation which we covered in our earlier videos dedicated solely to back prop and finally as we saw in our video that demonstrates how a neural network learns we know what to do with this gradient after it's calculated we update our weights with it well we don't per se but stochastic gradient descent does with the goal in mind to find the most optimal weight for each connection that will minimize the total loss of the network so with this understanding we're now going to talk about the vanishing gradient problem we're first going to answer well what the heck is the vanishing gradient problem anyway here we'll cover the idea conceptually we'll then move our discussion to talking about how this problem occurs then with the understanding that we'll have developed up to this point we'll discuss the problem of exploding gradients which we'll see is actually very similar to the vanishing gradient problem and so we'll be able to take what we learned about that problem and apply it to this new one so what is the vanishing gradient problem well first talk about this kind of generally and then we'll get into the details in just a few moments in general this is a problem that causes major difficulty when training a neural network more specifically this is a problem that involves the weights in earlier layers of the network recall that during training stochastic gradient descent or SGD works to calculate the gradient of the loss with respect to weights in the network now sometimes and we'll speak more about why this is in a bit the gradient with respect to weights in earlier layers of the network becomes really small like vanishingly small hence vanishing gradient okay what's the big deal with a small gradient well once SGD calculates this gradient with respect to a particular weight it uses this value to update that weight so the weight gets updated in some way that is proportional to the gradient if the gradient is vanishingly small then this update is in turn going to be vanishingly small as well so if this new updated value of the weight has just barely moved from its original value then it's not really doing much for the network this change is not going to carry through the network very well to help reduce the loss because it's barely changed at all from where it was before the update occurred therefore this weight becomes kind of stuck never really updating enough to even get close to its optimal value which has implications for the remainder of the network to the right of this one weight and impairs the ability of the network to learn so now that we know what this problem is how exactly does this problem occur we know from what we learned about back propagation that the gradient of the loss with respect to any given weight is going to be the product of some derivatives that depend on components that reside later in the network so given this the earlier in the network a weight lives the more terms will be needed in the product that we just mentioned to get the gradient of the loss with respect to this weight what happens if the terms in this product or at least some of them are small and by small we mean less than one small well the product of a bunch of numbers less than one is going to give us an even smaller number right okay cool so as we mentioned earlier we now take this result the small number and update our weight with it recall that we do this update by first multiplying this number by our learning rate which itself is a small number usually ranging between point zero one and point zero zero zero one so now the result of this product is an even smaller number then we subtract this number from the weight and the final result of this difference is going to be the value of the updated weight now you can think about if the gradient that we obtained with respect to this weight was already really small ie vanishing then by the time we multiply it by the learning rate the product is going to be even smaller and so when we subtract this teeny tiny number from the weight it's just barely going to move the weight at all so essentially the weight gets into this kind of stuck State not moving not quote learning and therefore not really helping to meet the overall objective of minimizing the loss of the network and we can see why earlier weights are subject to this problem because as we said that earlier in the network the weight resides the more terms are going to be included in the product to calculate the gradient and the more terms were multiplying together that are less than one the quicker the gradient is going to vanish so now let's talk about this problem in the opposite direction not a gradient that vanishes but rather a gradient that explodes well think about the conversation we just had about how the vanishing gradient problem occurs with weights early in the network due to a product of at least some relatively small values now think about calculating the gradient with respect to the same weight but instead of really small terms what if they were large and by large we mean greater than one well if we multiply a bunch of terms together that are all greater than one we're going to get something greater than one and perhaps even a lot greater than one the same argument holds here that we discussed about the vanishing gradient where the early in the network a wait lives the more terms will be needed in the product we just mentioned and so the more of these larger value terms we have being multiplied together the larger the gradient is going to be thus essentially exploding in size and with this gradient we go through the same process to proportionally update our weight with it that we talked about earlier but this time instead of barely moving our weight with this update we're going to greatly move it so much so perhaps that the optimal value for this weight won't ever be achieved because the proportion to which the weight becomes updated with each epoch is just too large and continues to move further and further away from its optimal value so a main takeaway that we should be able to gain from this discussion is that the problem of vanishing gradients and exploding gradients is actually a more general problem of unstable gradients this problem was actually a huge barrier to training neural networks in the past and now we can see why that is in a later video we'll talk about techniques that have been developed to combat against this problem for now let's take our discussion to the comments let me know what you're thinking thanks for watching see you next time [Music] you

Info

Channel: deeplizard

Views: 73,437

Rating: undefined out of 5

Keywords: Keras, deep learning, machine learning, artificial neural network, neural network, neural net, AI, artificial intelligence, Theano, Tensorflow, tutorial, Python, supervised learning, unsupervised learning, Sequential model, transfer learning, image classification, convolutional neural network, CNN, categorical crossentropy, relu, activation function, stochastic gradient descent, educational, education, fine-tune, data augmentation, autoencoders, clustering, batch normalization

Id: qO_NLVjD6zE

Channel Id: undefined

Length: 7min 43sec (463 seconds)

Published: Fri Mar 23 2018