Tutorial 6-Chain Rule of Differentiation with BackPropagation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello my name is Krishna and welcome to my youtube channel today we are going to discuss about chain rule in backpropagation see guys in deep learning the most important thing to understand is back propagation if you are able to understand back propagation then you will be able to understand very easily the last function of the optimizers and how they work so that is the reason why I'm continuously making different different videos with respect to back proposition in this video we are just going to go a little bit more depth will try to understand more derivatives function will try to find out more different kind of slopes and then we will try to discuss more about back propagation apart from that we also understand what exactly is chain rule ok this this is fundamental this this is the most fundamental thing in back propagation for a multi-layered neural network so let us just begin suppose this is my input features okay I have my input features x1 x2 x3 x4 and then this features are getting connected to two hidden layers so this is my first hidden layer one it is a 1 this is not a delayed two right in the hidden layer one I have three neurons like then let to I have to you know finally I have the output over here the best way to define the weights is that for the first layer I am writing it as one w11 and I am just giving a suffix to on top of it saying as one that basically indicates that it is the weights for the first hidden layer similarly for the second middle layer you have w1 1 to the suffix of 2 and oh here this function basically indicates two things right step 1 and step 2 as I discussed in my previous videos it will multiply the weight the submission of the weights and the input features and it will add the bias apart from that it will also perform that division function the different types of activation function which I have already discussed like Sigma 1 and I loop okay so that particular activation function that whole operation I am just writing it as function 1 1 we have function 1 2 because this is my first hidden layer second you know similarly this is my first hidden layer this is my third neuron so the output of this function I am writing it as a 1 1 okay the output of this function I am writing at over 1 2 similarly the output of this function I am writing Oh two one and the output of this function okay which is my output layer this is basically my output layer I am writing it as oh three one which is nothing but Y hat Y hat is missing in my predicted value and from this I will generate my loss function the loss function formula I'm writing it as summation of I is equal to 1 to n y minus y hat whole square right so this particular loss function we have already discussed about over here we are trying to find out the difference between the actual value and the predicted value and our main functions of the optimizer in the back propagation is that we have to reduce this loss when okay that work is basically done back optimizer in my next video I will be discussing about different kind of optimizer like in my previous leader I have already discussed about draining descent okay I'll also discuss about stochastic gradient descent another kind of optimize is like Adam optimizes a lot so just understand so how this loss function gets reduced okay the main thing is through the back propagation in the back propagation what we do may update this weights okay we update this weights in each every hidden layer that is what we do unless and until we don't find the suitable weights where in our output predicted value will be equal to the real the actual value which is why unless and until we don't get that in that particular way where this Y hat and Y is matching will be obtaining this all weights and that is basically done through back propagation now the next thing is that let us just imagine let us just understand how this updation basically happens now I am going into depth now how the updation happens okay and what is basically the formula first of all weight updation is basically having this particular formula which I have actually discussed with my biggest mass also previous session also let me just say that I want to update this weight first okay so in the backpropagation what we do is that we get a y hat we get the lost value right now we back propagate while back propagating we have to update this weights we have to update this week right let us just see for the first layer from the backward side how the weights gets updated okay now I just write a small equation w11 new ok so W one through the suffix of three which indicates that this is one second third hidden layer weights okay and then I write it as W 1 1 or e 2 minus learning rate or derivative of W 1 1 to the power of 3 I mean to the suffix of 3 now this derivative you know what is this value this is this email learning rate okay how do you select a learning rate by applying some hyper parameter optimization again I'll show you that in the example of a practical application and I am doing in the future classes right but just I understand how do we find out this derivative that is most important and what how it is related to something by chain rule that will try to understand ok now I think going to take this derivative of L with respect to I mean derivative of loss with respect to the derivative of W that is this particular weight right we need to find out the slope this basically indicates the slope slope of this particular weight value ok now I will just take this component because I know this value I know the learning rate how do I determine this particular value that we need to understand ok so I am taking this particular component DL by B the blue one one to the suffix three I can write this value as now see everybody this way it impacts the output oh three one yes it impacts output Oh three one since it impacts output of oh three one I can basically write derivative of loss with respect to the blue one one two the suffix three as derivative of loss divided by the derivative of output that we have over here okay and multiply it by derivative of this output with respect to the derivative of this particular weight and this is basically a chain rule guys see this this two are equivalent because if I cancel this the I think yes sir we have studied this in our lamenting tools standard also chain rule right if this is my derivative I can also write this range whatever as this particular way okay this is basically my childhood okay so I can write the same thing what it is basically impacting it is impacting o31 which is my output so I can write it as of L that is my loss with respect to derivative of oh three one x derivative of 4 3 1 divided by derivative of Oh w11 to do the suffix of 3 ok so now this is pretty much simpler with this simple chain rule okay now suppose I want to find out the derivative of this particular value ok variability vas loss okay just let me lock this okay not let let me just drop this I want to find out derivative of loss with respect to derivative of w212 the suffix of 2 which is basically indicates month oh sorry this should be 3 okay so let me just write it as 3 ok how can we write it down now if we know that W 21 is basically impacting oh 3 1 so I can also write this equation as DL by that is derivative of laws divided by derivative of oh 3 1 multiplied by derivative of oh 3 1 divided by teri beti possible W 2 1 again with the help of for childhood we can write this similarly why I am writing w212 the suffix 3 because this is my weight right and this is impacting o 3 1 so I am just creating this particular chain rule now this is very simple this is very simple till the first length now just imagine if I want to find out the derivative of this particular value what should i do what should I do if I want to find out the derivative of W 1 1 to the suffix to not see this again I grab this all I rub this and I just write W 1 1 to Nu okay I am updating this weight then here I'll write W 1 1 2 old - learning rate okay and then derivative of loss with respect to W 1 1 2 now I need to find out what exactly is this particular value ok let me just take another pen so that will be very clear for you all so I drank BL w11 - ok how do i how do I write how do I find out the derivative of this is that I know again I go from back propagation guys I know Oh 3 1 is getting impacted with the 102 one is just basically getting impacted with w1 1 to the base to the suffix 2 right to the suffrage - so what I do is that I will try to write something like this DL and go from back end I'll strike Oh 3 1 again I will multiply with what with what derivative of 4 3 1 divided by derivative of Oh - 1 because this is impacting this similarly this is impacting this and finally I write x derivative of 4 - 1 divided by very relative of w116 - so this is what is chain rule just understand this is impacting this this is impacting this so from the back end I'm going to find out the derivative of this then I'm going to find out the derivative of this with respect to this values now finally I'm going to find out the derivative of this particular weights and that is how it is going to happen you know this is basically the chain rule but still understand I see now first of all this is impacting in this way right this is actually impacting in this way but you should also understand we are also providing some weights to the second node and this is my second layer which is also impacting this right I have two connections to the output layer from here one is this way one is this way one is this way and one is this way how do I solve this again so what I do is that after finding this derivative I will be adding one more derivative ok suppose if I say that this particular day output this particular output is o2 - and say that this is suppose I want to say it as W 1 - 2 because this is from the first neuron it is sending to the second neuron in the second layer so I will say that how do I find out the derivative of this particular value no I will add with dl / o 3 1 which is my last layer this is getting impacted by o2 - well multiply BL o 3 1 / v lo2 to write again this is getting impacted by what the blue one to sign just write it as B o2 2 divided by derivative of D 1 2 and this addition when we do the derivative of all this addition I basically will be getting derivative of derivative of loss with respect to derivative of W 1 1 to the suffix 2 which is for the second layer wait now this is what the process is happening see guys because there is two paths that is going to the output so this two ways are basically impacting the outputs in two different way so one way is basically I'm finding out the derivative and I'm again adding the other derivatives also from the backpropagation bit okay and that is the path you should form similarly if you want to find out this just try to find out how many paths are basically going ok how many paths are basically point ok from here there is there is only two paths that reaches the way up but if I go from this node it may be two paths again from this node also it may be to pass from here it will be having three parts it will be having three parts okay all right three to four parts so with respect to - only we calculate all this derivative and finally this derivative will get calculated again when this derivative is getting updated that basically means weights are getting updated and when weights are getting updated your Y hat is going to change how it is going to change because just remember in the previous class or not discuss about descent unless and until if we don't reach the global minima this step will be going on you know and once we reach to the global minima then at that time we can consider that this is the most perfect solution with respect to the minimum great star - I mean with respect to the weights that we have got with respect to this particular problem statement and this way it specifies that it will be able to classify any inputs that we give with respect to the output that we have in our actual value also and then we compare those like the loss will be very minimal and always remember guys whenever we are training we'll be seeing that the loss will be increasing with respect to each and every Deepak considers epoch is with one forward propagation put propagation okay and it may be with respect to multiple inputs batches of inputs or the whole inputs present in our data so I hope you like this particular video this is what is called as chain rule okay you need to understand this this is the math behind it if you understand this you'll be able to understand the back propagation properly please do let me know if you have any queries I hope you like this particular video please make sure you subscribe to channel I'll see you in the next video and never give up keep on learning thank you one and all

Info

Channel: Krish Naik

Views: 85,898

Rating: undefined out of 5

Keywords: backpropagation example with numbers, softmax backpropagation, relu backpropagation, keras backpropagation, backpropagation questions, backpropagation paper, back propagation matrix multiplication, backpropagation visualization

Id: CRB266Eyjkg

Channel Id: undefined

Length: 13min 42sec (822 seconds)

Published: Fri Jul 19 2019