The Stochastic Gradient Descent Algorithm

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] okay now we're gonna change yours and talk about optimization so at the cornerstone of all of neural network training or even all of machine learning is this idea of optimization across some data with a with some kind of model assumed inside of there so we're going to talk a bit about how this is happening on the back end the real nuts and bolts of the optimization engines or at least one of the optimizations engines that's commonly used to do this this is out of this book here by Brunton and cuts section 6.4 you can find details on the book as well as more lectures more sews code in fact all the code that I'll present here is on this website data book u-dub calm the I even though I'll be lecturing in MATLAB there's Python equivalence is there as well so I want to talk about optimization and particularly I want to talk about gradient descent gradient descent is sort of the workhorse algorithm in optimization especially for high dimensional systems not even not even high dimensional systems but just in general as a tool for finding an algorithm to locate maxima or minima of functions okay of our landscape of something that you want to optimize over so a gradient descent is the workhorse grading descent is sort of is ubiquitous across scientists just because it's a very generic tool and one way to think about it is just simply in a high dimensional landscape how do you walk downhill that's what you want to do is get to the global minima and of course gradient descent can't actually guarantee that you're going to get to a global minima because you might get stuck in local minima but we'll talk about that as we go along here I'm gonna walk the gradient descent but I also want to talk about one of the more important versions of gradient descent that's commonly used especially neural network training for very high dimensional systems so what we really want to talk about is stochastic gradient descent so I want to talk about how stochastic gradient descent is connect the gradient descent and then walk through a little MATLAB code to show you its implementation in practice okay so as the name sounds it's a it's a version of gradient descent and the part that we want to really understand is what does it mean to have a stochastic component so first let's back up and let's talk about gradient descent so typically in optimization you are trying to find some function and to find a set of parameters so that you minimize some error for instance and so for instance here's a generic representation of that I want to take the argument of some error e which is depending upon some different values of matrices a1 through a of M now so I want to optimize over all those weights AF J so the a of J's here are gonna represent essentially the weights from one layer of a network to another so here the idea is have M layers and each layer has a set of weights so each layers weights are called a 1 or a 2 or a of M they're just the collection of all the weights and so the idea and what I've showed you before in the backpropagation lecture which is the previous lecture to this which is on section 6.3 is that you can frame this optimization problem as map an input to an output and how do you map this so that you reduce the error find all the weights that minimize the error so giving a training data set that takes you from a set of inputs to an output you know the label is correctly labeled then find me all these weights a of J's so that some error is minimized okay so that's it that's the that's the whole game and the way these weights are minimized are in formulating this objective function I'll refer you to the back propagation lecture which will tell you how to formulate this practically given non-linear activation functions going from one layer to a next but let's just say that this is in fact what I want to optimize over and so I got to pick all these weights and of course there's a high dimensional space there's a lot of weights to be picked out and in modern neural network training these are extremely high dimensional and then you've got to figure out how to update all those weight values and walk downhill to some making some minima and preferably of course a local minima but of course sorry a global minima but you can always get stuck in local minima and we're going to talk about the stochastic great in dissonant which is gonna help us out with this problem quite a bit both in terms of making the optimization much more computationally tractable but also in terms of getting out of local minima because of the stochastic aspect of the of the of the search gradient search so here's what we really want to do you give me a set of functions I'm weights I want to take a mapping of a function f from an input X of K to an output Y sub K so this is a generic representation and remember that this function f would ultimately represent from a gradient descent point of view a compositional structure where your mapping from one layer to another through a some non-linear function and so f itself would represent some compositional framework and back propagation would tell us how to compute derivatives to minimize this and so I want to walk over all the parameters all the weights that I would have to minimize this error so that's what's going to be how we are going to approach this from a neural network training perspective this is the computation that's very costly in training so training a neural network is expensive computationally because I have a lot of weights that I have to train over to minimize this okay so that's the objective in fact normally I'd have endpoints or in I'd have a lot either I have have to walk over all the data points through all the layers in all the weights for this thing okay so suppose I give you to set with you know a million training samples with a million labels so you're gonna optimize this over all a million of the data set okay for all the weights so that's a very large computation to do and typically the way this is done is just with gradient descent and we talked about gradient sent previously which is I want to basically take my current vector of weights let's call it X of J walk it to the next set of values X of J plus 1 and the way I'm gonna do that is by updating the X of J and by including this gradient computation so if F is my map from that takes me from an input to an output then what I want to do is take the gradient of this map over those X of J and then update it with some parametrization Delta now Delta itself is what's often called the learning rate so this is how much of a step in the direction of this gradient do you want to take to update your solution okay so gradient descent methods are basically trying to walk downhill so at any point you are in space you compute your local gradient you walk downhill and then the step you take downhill is determine like that Delta so this Delta will tell you if you take a very large step downhill which in a high dimensional space can be very you can overstep and walk yourself over to a whole different regime or you take a small little step in the right direction but if you take really small steps you could be optimizing forever because you keep taking very marginal steps down towards where the minima is okay but in either case that's a generic framework of gradient descent with your function compute the gradient walk downhill and I showed you how to do this with a neural net when I talked about the back propagation algorithm you give me nonlinear activation functions between layers and it will give you a way through back propagation to compute this very easily it's just chain rule so that's important to keep in mind you do a chain rule computation and typically the activation functions are chosen so that they're very easy to differentiate again I want to highlight something like the REA Lu which is piecewise linear so doing the computation of the derivatives on a rayleigh function trivial just the slope of the rayleigh function and it's zero on the left side so so you typically are picking activation functions so that this so that when you do this high dimensional optimization that this here is actually pretty easy to compute even though you have to do a lot a lot of computations now typically what's done is you're gonna evaluate your progress of your error by monitoring your error over all the data points okay so in other words my error which depends upon the weights I've picked is gonna be basically the error over all the data points so I really want to minimize it so if I have a million data points it's not just how well did it do with one data point it's I want to look in fine to minimize this error over all the data that I have so this in here is basically all the data that I have okay where the e of K itself remember it is this function f which is a composition which is from an input X of K it's an output Y sub K and I just take the square of that error so this here we talked about back prop but now I'm going to put it in here and I got to sum over all the data points so I'm trying to find the weights so that given us large set of data million 10 million 100,000 whatever it happens to be that what I'm gonna do is find the best weights over all the data that's given to me okay so this is very costly okay but it does give you a global way to evaluate the this over all of the data that doesn't mean you will get you won't get around local minima but it does give you at least an evaluation metric over all the data so the difference here between gradient descent is and stochastic great ascent is going to become now very clear in particular what stochastic gradient descent is gonna do is gonna say how about this instead of computing my air and my weights over all of my data I will compute this only over a single data point so out of my million data points I'm going to select one randomly compute the gradient there now you can see how much this is gonna be much more cost-effective because I'm not gonna have to walk over all a million points I might just pick one point compute the gradient make an update step so here's the update of the weights the weights currently I compute this gradient and I get my next set of weights okay so what I've done basically is with one data point computing the gradient with that one single data point I've given you a to update and find a new solution or new set of weights w jf plus 1 now what's interesting about this clearly it's not as good as using all your data however notice the advantages first computational cost I'm only using a single data point so I'm just going to use this and of course there's also something bad about that which is I'm only using a single data point but computationally then a single data point is going to be a lot faster than using all my data okay second everytime I update this I randomly choose a new data set data point so if I'm gonna use this kind of structure what I'm gonna do is maybe I picked out of all my million data points I pick you know the eighth the 500th data point and the next time I pick one hundred 130 mm data point so I keep switching what your data I use to compute it as I update now what this is again allow me to do is to randomly sample the space of the data the hope being that in fact it's much harder to get stuck in a local minima because I am randomly walking around so something that's a local minima for one data point is highly unlikely or with very low probability going to be a local minima for another data point so by randomly sub selecting data points I can also more effectively walk around local minima to try to get to the global minima global minima so those are the advantages and again the disadvantages is that you're using a very small subset of the data with the hope that if I take many iterations I can walk around local minima and down to some kind of global minima now oftentimes you don't actually use a single data point right so this says you don't just take a single data point which is the kaif data point here instead of all the data just take one so what people often do is take batches so this is a common term used in training a fitness of neural networks which is back size or what not just is simply referring to how many data points do you want to take I can take a sub selection randomly chosen of K 1 through K P points so the point is this K here represents a selection of data P of them and I can now compute this gradient using p points instead of one point or all the points so this is a way which can help frame this in a much better way because instead of one data point carrying all the weight of what directions you should take a next step what you do here is you take a small selection have them vote on which way is downhill take a step in that direction now what you do is once you've taken a step in this direction you've updated your weights based upon the gradient computation of that randomly chosen batch you now take another random batch take another step another random batch take a step so notice the advantages you get in the stochastic gradient descent method first you've now taken this optimization to a much smaller optimization problem right so now if you notice this you're using P data points versus in with the idea that P is really quite a bit smaller than n so computationally it's much faster second by every single step picking a new batch your ability to walk around global local minima it increases significantly so now you're in a situation where you can essentially get yourself cost savings because you're taking doing much smaller problem but also you can avoid some of the pitfalls by of gradient descent which is getting stuck in local minima because you're continually randomly picking different batches and it's with very low probability that those random batches you pick would have the same local minima so oftentimes you just walk right around the local minima down into something that's closer to a global minimum so this is the advantage so stochastic gradient descent is for those reasons is sort of your workhorse architecture for doing large-scale optimization of required of neural networks so it's a very powerful technique and this is based upon standard gradient descent which is just look try to compute how to go downhill which is through your gradient and walk that way but now you're just doing it randomly with small selections of the data so what I want to do is actually give you an example of this in a code here and here's the code so what I wrote is a little stochastic gradient descent algorithm for a very simple problem and I'm going to show you the results of it and you can play around with this is very simple to watch how this thing is actually working so let's go here and let's start setting this thing up so the beginning of this is very simple I just define a domain so you know I'm an X and Y domain so it's gonna be a two-dimensional surface my step size between them is 0.1 and I make a mesh grid of this so my x and y variables are a two-dimensional domain going from negative 6 to 6 and negative 6 to 6 in the X Y directions and I'm gonna define a function f 1 here it is which is one point five minus one point six some exponential and I make my function actually be f 1 plus some other exponential in there so it's just basically two Exponential's glued together you'll see a picture of it in a moment the main thing I wanted to show is that had a local minima and a global minima but that's it there's this two minima in there one which is local long lunch is global and so what we're going to do with this is we want to say from this surface pick a point and see if I can do gradient descent or stochastic gradient descent to find the minima of this function so every time I do this on this surface I can compute a gradient over the whole thing by the way I can do that right here with this command gradient of F so that will give me the gradient over the entire spatial domain or what I'm gonna do is randomly select a small set of values compute the gradient like you would in stochastic gradient descent so you do a draw of a small random set of values compute the gradient update that random draw and keep walking downhill so here's how it's done here is my random selection of points I'm going to use the Rand perm command so I take this thing ran firm of n so in other words my high dimensional state space of is n-dimensional that's the function space which is two-dimensional which has n values or grid points and what I do is just take I randomly permute them and here what I do is take the first ten so my batch size is gonna take ten points randomly from the surface computer gradient okay so that's what I'm going to do here that's that's what I've defined I 1 and I 2 is just a random subselection of these points and then what you can do is you can compute the gradient of these at these points so here's the loop and what it's doing for you here is computing essentially you take your function and we're going to do here is we're going to compute the gradient at these points okay at some randomly selected sub selection of points so I'm going to throw in these indices i-one i-two right into this here to produce a permutation of these things so I'm gonna out of all the points available take ten at a time computer gradient go through the loop again take ten points randomly go downhill again and just keep set electing out randomly ten points okay what I'm gonna do here this is tau this is gonna be my learning rate so that's gonna be my step size remember that the gradient descent says my future weights or my future whatever I'm trying to optimize my future weights or my gradient descents gonna say my future is equal to what I have now plus a correction which is the gradient plus times the learning rate so the gradient is going to be computed here and my learning rate is tala there okay and then I just update so at the end here what I'm going to do is I just continue to update the solution so once I find a new point I over oh through old ones and then I update it updated updated that's what this thing does here okay and by the way I stopped here so here's the computation of by the way this is the computation of the update of the solution and update of the derivatives for the gradient descent and you can walk this through this code a little bit more detail at your leisure but the idea is very simple you just walk through this you have this very simple function random downselect ten points do the gradient descent okay and if you if your error in other words if your if your weights are not updating by more than less than ten to minus six in other words if you converge to a set of weights you break out of this loop okay and otherwise you update your weights and you keep going through the loop so let me just show you this I'm gonna run it and you'll see what the figures look like so there thing going through and look let me move these over here okay and here is an example of this having run this and these are both the top view of this and a bottom view okay so there you go so this is the landscape that I built and notice I started here with this red and it walked downhill but of course it got into this local minima so even though I did us to catch the gradient descent this local minima is quite large so to get out of it you'd have to really take a much bigger potentially learning rate so that if I ran only moved very far away from this I might get over this little hill that's here separating this to get over into the global minima which is the blue so by the way so this is very common in gradient descent is it's very easy to get stuck in local minima and so what you need to do is when you get stalled there you might want to search around or perturb away from it see if you can still continue downhill and there's other West's methods do that but right now this is just a very simple example of a stochastic gradient method search however I start here and notice I just converged right down with the stochastic gradient method down to this minima and there you go this is now the top view of this showing the convergence algorithms and again you can play with this very easily we can change that learning rate to make bigger steps or smaller steps and we can see what happens with those so let's do this and I doubled the step size and also where I started my initial condition and again you can start seeing that it converges much more quickly because I'm taking much bigger steps because look how big these steps are walking down that hill okay same thing here but it's still not my ran that I haven't chosen you know I haven't been able to get out of this this is a pretty hard hill to get over from that side to this side but at least it'll astrays for you a lot of the common features that you would see in a gradient descent right which is this kind of structure getting yourself to a global minima either through here or here and remember the advantage of what I just showed you is I'm not computing the grading over this entire surface every time I take an update you pick ten points and ten points only with those ten points I can walk down that hill and get myself into one of the minimum so that's the main idea behind stochastic gradient descent it's a really powerful tool and one of the workhorses of modern neural networks is to think about this optimization process which is extremely expensive and instead of doing this gradient over all points pick a batch size randomly compute your gradient take a step pick another random batch compute the gradient just take a step walk yourself downhill into some kind of global minima so that's the main idea behind stochastic gradient descent I thought it would be important for you to recognize that people have been very clever about thinking about ways to do this kind of optimization of scale and again for a neural network you really have to think about how do you go to scale because the number of weights in your systems is extremely high normally when we think about gradient descent or even we learn it in a class you know it's maybe you're doing two or three parameters gradient descent over it here we're talking millions billions it's a totally different ballgame and so of course lots of local minima very expensive too and so stochastic gradient method really has these very nice features to be exploited that help you train a neural network in a better way faster with also some randomness in it so that maybe you'll get out of some of your local minima so that's going to be it for this lecture now you have some idea about how stochastic gradient method works and this is basically put on to the backpropagation architecture and those two together form let's call it your backbone or the the the dominant paradigm for optimization over neural networks certainly the rise and success of neural network was done on the backs of those two important ingredients the stochastic gradient descent coupled with this back propagation algorithm together so that you could actually train large-scale networks at scale this is in this chapter six point four of this book by Brian Cutts you can find more information here at databook u-dub com lots of extra lectures that are there and you can follow along there for all kinds of interesting things as well as a code base if you want to look at the notes for the book here they are on databook PDF off of that website and you can download for yourself a PDF file with all the information

Info

Channel: Nathan Kutz

Views: 4,959

Rating: undefined out of 5

Keywords: neural networks, deep learning, stochastic gradient descent, gradient descent, machine learning, optimization, kutz, brunton

Id: _adhFSH66jc

Channel Id: undefined

Length: 27min 48sec (1668 seconds)

Published: Thu May 07 2020