Exploding Gradient and Vanishing Gradient problem in deep neural network|Deep learning tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome to unfold data science friends this is aman here and i am a data scientist in this picture that you can see in front of you i have a neural network where i have input layer output layer and three hidden layers okay i am calling this as a deep neural network now in general a deep neural network has multiple hidden layers but to keep it very simple i am just keeping three hidden layers we will try to understand two of the very important concept from deep learning point of view and also interview point of view using this network okay one is known as exploding gradient problem okay and other is known as vanishing gradient problem these are the must know concept if you want to understand how to initialize the weights in neural network so in neural network first we have to initialize the weights and then through back propagation the weight updation happens right that is what we discussed in last video now what are these exploding gradient and vanishing gradient problem we will take an example and try to understand so can you recollect what is the formula for gradient descent so if i have to optimize weight w 1 1 then how gradient descent will work is w 1 1 old is equal to w 1 1 nu is equal to w 1 1 old minus eta which is the learning rate multiplied by doe of loss function with respect to w 1 1 this is how gradient descent works i have explained that in a separate video you can see the link okay now with this formula all the weight updation happens in neural network okay so if you see this carefully the new weight w11 nu will vary from w11 old based on this particular input in the equation correct so if w one one old is let's say 1.5 then from this 1.5 how much shift will happen either forward shift or backward shift will depend on what is the output of this term here in this term there are two components one is called learning rate we keep the learning rate normally in the range of zero point one to 0.001 and other is called the gradient so this gradient creates those two problems of exploding and vanishing gradient let's try to understand how so if you can recollect little bit about how gradients work in my last video i was talking about something known as chain rule in mathematics correct chain rule right in this chain rule what we do what we did is if we have to compute derivative of loss with respect to w 1 1 then we had multiple terms here correct so these are different different derivatives actually term 1 term 2 term 3 because we do not have any direct relation of loss with w 1 1 hence we try to create a chain of different derivatives and try to find out the values and compute this term okay now little bit of mathematical background here what happens when a positive number greater than 1 is multiplied with other positive number greater than 1 for example when you multiply 2 with 2 you get 4 as an output right which is larger than these two individual numbers but what happens when you multiply 2 decimal numbers 0.1 into 0.2 the result we get is 0.02 which is smaller than both these numbers on this particular concept the problem of vanishing and exploding gradient works okay so in this multiplication of chain rule what will happen when all these numbers are a decimal number less than one what will happen is the final value will become lesser than all these three numbers and what will happen when all these three numbers are greater than one the final value will become too high too high then all these numbers correct and when you plug this derivative in that gradient descent formula the gradient will never optimize the value will never optimize okay let's try to understand with this diagram and data so what are the inputs here x1 and x2 okay so let us create a matrix call as input matrix x1 and x2 what is the weight here let us call this as weight at first level so i am calling it weight at level 1 means at this layer okay so what is the weight w11 0 0 and 1 2 okay now what goes as input to this node here it will be nothing but a multiplication of x and w for now we are not considering this b term the bias term for simplicity so input to this will be just a multiplication of x and w right from here an activation function will get applied on this correct so let us write activation of this now this will go as an input to this this node here correct this node here again one activation function will get applied correct so let us write another activation function and this will get multiplied with weight at this location right weight at this layer or to say in particular weight at this node so what will be the weight at this node let us say w21 so this gets multiplied with w21 and similarly as you keep moving forward you will see that multiple weights are getting involved right so when we come back and try to optimize w11 in that derivative chain rule we will see that the effect of too many weights are coming into picture now what happens when your weights are smaller here we have in this w we have w 1 1 and w 1 2 both here we have w 2 1 so when the weights are smaller then their multiplication will become more small when the weights are larger then their multiplication will explode will become too large and what will happen when we try to optimize this the bigger number will go as input so w nu is equal to w old minus eta of your derivative loss by w let us see for simplicity now when this term is too high which is the problem of exploding gradient what will happen is this w nu will keep oscillating between minimum value so it will not never come to minimum value for example so let's see minimum value is here then w old will be once here w new will come here and then again it might go here it might come back here because we are not giving this w an opportunity to stop here okay this problem when your gradient is too high is called exploding gradient problem because the step size is very big from here it will come here from here it will come here like this and when this derivative becomes low then what happens is there will not be too much difference between w nu and w old because this term itself is very small that problem is called vanishing gradient problem okay now the whole thing boils to weight initialization if you see so what weight we should give we are saying that if it's a positive number greater than one it will explode if it's a negative number uh less than one uh if it's a positive number between zero to one then it will shrink the another problem will happen where it will just shrink so what should be the optimized weight the optimized there are different ways to give the optimal weight there are different ways to initialize the neural network which i will discuss in my next video for now if you have any doubts on this topic what is a vanishing gradient and what is exploding gradient problem write mean comment i will definitely respond to you i'll see you all in the next video till then stay safe and take care you
Info
Channel: Unfold Data Science
Views: 2,508
Rating: undefined out of 5
Keywords: Exploding Gradient and Vanishing Gradient problem in deep neural network|, deep learning tutorial, exploding gradient, exploding gradient problem, exploding gradients, exploding gradient and vanishing gradient, vanishing gradient, vanishing gradient problem, vanishing gradient neural network, vanishing gradient problem rnn, exploding gradient example, unfold data science
Id: IBODsB4q8cQ
Channel Id: undefined
Length: 8min 41sec (521 seconds)
Published: Tue May 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.