Tutorial 12- Stochastic Gradient Descent vs Gradient Descent

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Original Title: Tutorial 12- Stochastic Gradient Descent vs Gradient Descent
Author: Krish Naik
Description: Below are the various playlist created on ML,Data Science and Deep Learning. Please subscribe and support the channel. Happy Learning! Deep Learning ...
Youtube URL: https://www.youtube.com/watch?v=FpDsDn-fBKA

👍︎︎ 1 👤︎︎ u/aivideos 📅︎︎ Aug 12 2019 🗫︎ replies

Captions

hello all my name is Krishna and welcome to my youtube channel today we will be discussing about stochastic gradient descent in my previous video in the deep learning playlist in my deep learning finish I basically discussed about gradient descent now we will try to understand what exactly is stochastic gradient descent what is the exact difference between this SGD and the gradient descent itself so let me just take the backpropagation that i had actually discussed in my previous video over there the weight of patient formula was basically given by this right so I was writing W is equal to W old- learning rate right x derivative of loss with respect to variability of the blue world now for my previous video guys remember about grading descent what I was actually talking about I was saying that gradient descent is one function which will have this kind of shape right suppose it is having this kind of shape and this particular value let me just draw this line once again let me just draw this particular line once again suppose this particular line is basically my weights right now initially my weights and got initialized over here so if I just go ahead take this particular point suppose this is my y 1w I'm sorry and this particular weights right initially we need to find out the derivative that is basically the slope right and when we are finding by this we used to understand that whether we have to increase the weight over here or suppose if the point is somewhere here based on the slope this slope is basically negative slope because the right hand side of the line of this particular tangent line it is falling towards downwards so we can consider this as a negative slope whereas in this case if I draw the derivative line the left hand side the right hand side of that particular line is pointing upwards so this becomes a positive slope this is the shorter that means that I usually use for explaining you but by default whenever you are trying to find out the derivative of this particular value this will be negative and that will be positive now remember I have to reach at this particular point and this point is basically my global minima this particular topic is basically about grading disengage then I will make you understand what exactly is difference between HDD and the gradient descent now during the back propagation suppose my points are over here then we make sure that the point you know is propagated or it moves toward this particular global minima that basically means we make our weight increase to this particular point if our points are initialized randomly over here with respect to the weight so we make sure that this is basically propagated and we actually reach the global minima now this was about the gradient descent and this is how the weights were getting updated unless and until we don't reach this point that basically means that we have not got the exact weight that is basically required to solve the problems right now this was about gradient descent now let me just drop this okay from this going to love this and let us go back to our formula rate updation now in this particular formula the most important thing that we need to find out this derivative of loss with respect to derivative of the blue old so in order to find this in order to find this suppose if I consider if I consider all the data points of a data set all the data points let me say that I am considering all the N points all the n data points suppose in my data set I have thousand records right or this may be also called as a data points so I am going to consider all the thousand data point whenever I am considering to solve this particular derivative of loss with respect to this particular weight at that time I will basically be saying it as gradient descent the technique that will be used for the convergence the function that will be used is basically called as gradient descent now suppose if I am trying to find out this derivative of flaws considering only one data point at a time that basically means at every pok I am just taking one record and finding the Y hat and in the backpropagation i am updating the weight one point at a time at that time I will say it as SGD that is stochastic gradient descent now the next thing is that if I consider key data points key data points suppose I consider where K is always less than n okay where K is always less than and that is the total number of data points so that time I will basically call it as mini-batch stochastic gradient descent that basically means that suppose my K value is 100 initially what I do is that I will send 100 data points I will capture the Y hat for each and every data point and suppose if I want to find out the loss how the formula will look like it will be looking like I is equal to 1 to K Y minus y hat whole Square and this particular loss I will try to increase remember in gradient descent this formula changes to I is equal to 1 to n y minus y hat whole square right in the case of stochastic gradient descent what will happen this summation will go off because here I am trying to update the weights for each and every data point so this formula just becomes Y minus y hat of whole square right so this is how the difference is this is my new the first technique which is mini-batch SGD this particular technique is basically my gradient descent and this is basically my HDD so by using this particular way we have this particular difference one is stochastic gradient descent one is mini-batches duty nowadays in many neural networks right and then let it be CNN this kind of techniques are basically used which is mini-batches duty right this stochastic gradient descent is used in if you remember linear regression problem statement at that time as GD is basically used in machine learning if you remember I hope you know how to you know solve a linear regression problem the library that is present in a scalar over there each and every data point is basically considered to capture the weight right to update the weight now why do we require to do mini-batches duty well just imagine guys if we are solving using gradient descent and suppose your data set is basically having million records okay if your data set is basically having million records at that time just understand the amount of computation pile that will be requiring to load the whole 1 million records right in order to prevent that thing we basically use this particular mini-batch stochastic gradient descent in my tactic like in application when I will be showing you how to resolve this problem by using how to solve this particular problem by using chaos attack and I'll show you that we'll be specifying some batch in order to specify that how many data points I'll be using in order to update the weights at that time you'll find the exact difference but always remember that gradient descent right for the gradient descent there will be a huge number of records suppose in your data set you have million records so it becomes computationally powerful and it will require a lot of resources when I'm in the case of any batch based on this particular K value your updation will be happening and it will also require less number of resources when compared to the gradient descent now let us try to understand how if I'm considering mini batch SGD what are the small problems that we may face okay so I'm just going to rub this I'm going to draw a very simple 2d diagram regarding the global minima points and which is very much available in Wikipedia because I'm also drawing this particular diagram from Wikipedia okay so suppose I consider I'm just going to draw some circles like this consider that this is a two dimensional graph and when lost I have two weights suppose the blue one and the blue to this particular point specifies W 1 and W 2 okay now remember that suppose this is the point that I have to reach for all of my global minima global minima if I use gradient descent then this particular convergence will happen directly like this the path will go somewhere like this and it will reach finally over here but in the case of stochastic gradient descent remember in stochastic I mean sorry mini-batch stochastic gradient descent we are taking sample of data points so the convergence will not happen quickly you know the convergence will have some zigzag manner something like this you know it will take time and finally it will reach the convergence so what we can consider is that when I am finding the derivative of loss with respect to derivative of the blue old with the help of mini-batch mini-batch stochastic gradient descent will be approximately equal to derivative of loss with respect to derivative of the group old with respect to gradient descent whenever I am going to use so cache mini-batch stochastic gradient descent whenever I am trying to find out the derivative of loss with respect to that specific way and when I am updating that wait right this will be approximately equal it will never be equal because it is taking too much of time right here you can see that we haven't zigzag movement of SGD whereas in the case of gradient descent will just have a straight movement towards the global minima point but always remember there is one problem with reading the sentence that you need high computational memory just imagine if you want to load 1 million records 2 million 10 million records in a ramp directly write quite executing it so just imagine how much resources it may require so people nowadays use lot mini-batches judy when they specify some bad site let it be 100 200 depends on the computational power again so this example that I am showing you over here and I'm saying that derivative of loss with respect to derivative of the load in the aprox approximately equal to derivative of loss with respect to W whatever the load and this is with respect to gradient descent this is with respect to mini-batch theory now understand this is just like your if I say a statistical term this is just like a population right oh here I'm reusing all the data points this is just like a sample and you know that if I want to find out the mean and average of a population right with respect to the sample mean now this will be approximately equal it will not be exactly equal right so this zigzag movement is basically called as a noise it is basically called as a noisy data you know it has some noise now in order to remove this noise there is a concept called a stochastic gradient descent with momentum that I am going to discuss in my next video I hope you liked this particular explanation about stochastic gradient descent and you understood that when do we used to cast a gradient descent the main thing that you have to remember is that whenever we are using key data points as a batch you know which is just like a mini batch at that time you will basically be considering stochastic gradient descent as your you know technique for the convergence okay so I hope you like this particular video make sure you subscribe the chairman price I keep on making many videos as such share with all your friends whoever required this kind of hell I'll see you in the next video have a great day thank you one and all

Info

Channel: Krish Naik

Views: 93,873

Rating: undefined out of 5

Keywords: stochastic gradient descent explained, stochastic gradient descent python, stochastic gradient descent, stochastic gradient descent medium, stochastic gradient descent matlab, stochastic gradient descent convergence, stochastic gradient descent linear regression, stochastic gradient descent classifier

Id: FpDsDn-fBKA

Channel Id: undefined

Length: 12min 16sec (736 seconds)

Published: Sun Jul 28 2019