Tutorial 8- Exploding Gradient Problem in Neural Network

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Original Title: Tutorial 8- Exploding Gradient Problem in Neural Network
Author: Krish Naik
Description: After completing this video, you will know: What exploding gradients are and the problems they cause during training. How to know whether you may have ...
Youtube URL: https://www.youtube.com/watch?v=IJ9atfxFjOQ

👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Sep 14 2019 🗫︎ replies

Captions

hello uh my name is Krishna and welcome to my youtube channel today I'm going to show you a very interesting video which is basically on exploding reading problem you try to understand why and how does an exploding radium problem occur when three is really I've already discussed about vanishing gradient problems if you have not seen that please suggest before going to this particular video you have a look on to that so let us take a very good example to understand what exactly is exploding gradient problem Here I am basically taking a two hidden layer three middle layer neural network so this is my two hidden layers and this is basically my output layer so my input features are basically getting parts to this the first field that I have basically assigned is w1 1 to the suffix 1 then here you know the summation of weights and the input features happen and then an activation function like sigmoid gets applied after the activation function is getting applied and passing I am assigning to assign another weight which is like the Bluetooth 1 to the next neuron which is in the next hidden layer let me just say that this oh one one that I am trying to provide the input because this overrun gets multiplied with old of w1 to make it simple I am just writing it as Z okay now this particular value I'm passing over here same operation happens and similarly we will be getting the value in the output layer and after I get bound but that is basically my Y hat now when I have Y hat what I do in the next step I just pass it to a loss function now as you know that if I'm passing it to a loss function that basically will have to reduce that loss value and for that I'm basically using optimizer now how does an optimizer reduces the loss function is basically by updating the weights in the back propagation so each and every wedge gets updated unless and until this loss value gets reduced and then that lost value gets radius you will be seeing that Y hat and Y which is your actual data will be looking similar you know they'll be having the same equal value so let us just try to understand how does the weight updation happen as I have already discussed in the previous class so suppose if I want to update W 1 1 to the suffix 1 so I can write a weight updation formula which is there in this right hand side I can write it as W 1 1 u is equal to W on an old - learning rate multiplied by derivative of n right and with respect to the weights now a very important thing when does this exploding gradient problem happen it basically happens in this particular problem stupid now when I'm calculating the relative of loss with respect to w1 1 now by using the chain rule by using the chain rule I can write this how I am writing it now see this my output oh 3 1 is dependent on Oh - 102 one is getting impacted by Oh 1 1 and no one is getting impacted by w1 1 now if I want to find out the derivative of this I basically need to find a derivative of all these values and for that I'll be using a simple childhood my chain rule will be derivative of loss with respect to derivative of 8 is equal to derivative of oh 3 1 derivative of oh 3 1 divided by derivative of 4 2 1 then o 2 1 is getting impacted by this so I can multiply this derivative over 2 1 divided by derivative of or not and finally derivative oh oh 1 1 divided by derivative of W one month now this is basically a chain rule because see if I cancel all these things this will be del u video 3 1 divided by derivative of W 1 1 right so this is basically s chilhood ok now as I said that let me just consider this particular derivative because I will show you how does an exploding gradient problem happen and it just it does not just happen because of sigmoid function ok the main reason this exploding gradient problem happens is because of weights ok it is because of weights now you may be confused how and just show you a very good example because you know that sigmoid you know whenever we apply a sigmoid activation function transforms the values between 0 to 1 and the derivative of all sigmoid function is also between 0 to 0.25 you know that right so if you know this a derivative of sigmoid right it ranges between 0 to 0.25 if I say derivative of sigmoid like this ok the ranges between 0 to 0.25 ok so let us just take this value only and let's just try to compute this value so here I'll write it as derivative of 4 to 1 divided by derivative of Oh 1 1 ok I'm taking this let us solve this particular problem statement this particular derivative only okay now before solving this you know that I am giving the Z value over here because I have just been in 201 1 and Z ok so here you know that Z / what will happen Oh this Z basically is basically my function which is getting multiplied which will multiply W to one and oh one one so suppose I write over here I said Z is equal to nothing but W to 1 x oh one one plus the bias - over here I have bias one where I have bias - right so after this after this what this function does in the second set it applies an activation function so I am just writing this Oh to one in the form of the set value okay I am considering the multiplication of W to 1 x over 1 plus B 2 as said okay because this is the operation that usually happens in this neural network back and then I am basically applying an activation function and this activation function is basically my sigmoid suppose I have sigmoid 1 divided by 1 e to the power of minus then this is the activation function that is getting a plane now imagine how can I write this derivative of o2 1 with respect to derivative of oh one one that now you need to understand I know that I am giving the set everything is happening on the z right can I write like this derivative of activation function of Z divided by just just just just focus on this ok now this will be derivative of Z because this is it everything is happening on Z and similarly I can write with the help of chain rule derivative of Z divided by derivative of oh one month because one one is impacting that so I'm basically using a simple chain rule and basically using a simple chain rule 402 one to calculate derivative of 4 to 1 with respect to derivative of 4 1 1 for that I'm just saying that o 2 1 is nothing but it is a function of activation function of Z so I am writing derivative of an activation function of Z divided by derivative of Z multiplied by derivative of Z divided by derivative of 1 again this is a simple chain rule this is a simple chain rule ok now I know what is my derivative of Z I mean this activation function of Z write it if I if I just take this if I just take this let this particular situation be like this I know that this is my sigmoid activation function because what is this activation function of Z and I know that Seimone activation derivative will be ranging between 0 to 0.25 yes you know this so let me just drop this again this focus over here because this is important to understand I know that the derivative works activation function of Z will be the engine between zero and point two five we know this because in sigmoid activation function we know that it transforms the value between 0 to 1 and my previous session also I have discussed that derivative of this sigmoid function will always range between 0 to 0.25 ok now let me just consider over here the value will be ranging between 0 suppose if I take this I am just considering this as derivative because I don't have a little bit space over here so I'm just considering this so this will be ranging between 0.25 now multiplied by I am multiplying that derivative of Z divided by derivative of oh one one let me substitute what is that way I know that is this particular value this is this particular value itself and when I do derivative of oh one one derivative oh one one if I apply the derivative you see that oh one one is also present oh and this is a constant and if you know by simple rule of derivative this all will get reduced to something called as W - W - one because over one one will get reduced if you know derivative this will become 1 and this will become zero so I can write like this now now understand for first scene I do suppose I have point two five over here now when will my derivative value will be higher I told you the reason the derivative values becomes higher is because of weights now let us consider that man weight over here you realize this 500 okay now if I initialize 500 if I multiply this my value will be 125 now when my value is 125 just imagine for this particular element in my value is 125 if I multiply this suppose for this I calculate it as 100 considering my weights are higher so you guys weights whenever it is higher than own little performed is exploding radial problem now for this particular derivative my weights was very much higher and I got this particular value as 100 and here also I got it as 200 suppose if I multiply this it will be a larger value now when it is a larger value this consider this I am trying to replace this with a larger value suppose my learning rate is 1 so if I take the older value minus a larger value that may become a very small value but that is like a negative number and when it becomes a negative number this and this will vary a lot when it varies a lot understand guys the gradient descent will never converge it will be jumping here and there you know after each and every back propagation of the Box your weights will be varying a lot with respect to the older weights and with due to that you will never converge you will never come to a point you will never come to a global minima point and that is where it is very very important to understand how the weights are basically meat sliced how the weights initialization should should be done okay over here you are just not using sigmoid because you may use anything but if your weights are higher what may happen you may never converge to the global minima point so in my upcoming videos I also show you how a weights actually are getting initialized just understand whenever we trying to find out one derivative if my rates are higher I usually get a higher value of derivative right but you should know that my derivative of Sigma is between zero to point two five but because of weights this derivative value is coming higher larger and when it becomes larger because of the chain rule as I create a deep artificial neural network this particular derivative with respect to W and one will become a very big number and when it becomes a very big number if I try to apply that in the weight updation formula then what will happen this why old and why you will be having will be completely different there'll be huge gap between that and because of that what will happen after each and every back propagation it will never reach the global minimum point so that is why this weight of nation is very very important and this was all about exploding regime problem I hope you like this particular video so guys make sure you subscribe the channel if you have not already subscribed please do let me know if you have any questions regarding in the by putting your comments in the comment box itself I will see you all in the next video god bless you all have a great day ahead thank you

Info

Channel: Krish Naik

Views: 67,250

Rating: undefined out of 5

Keywords: understanding the exploding gradient problem, what solves exploding gradient problem, exploding gradient problem sigmoid, exploding gradient problem early stopping, relu exploding gradient, exploding gradient wiki, exploding gradient sigmoid

Id: IJ9atfxFjOQ

Channel Id: undefined

Length: 11min 5sec (665 seconds)

Published: Tue Jul 23 2019