Lecture 17.2 — Large Scale Machine Learning | Stochastic Gradient Descent — [ Andrew Ng ]

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
for many learning algorithms among the linear regression logistic regression and neural networks the way we derive the algorithm was by coming up with a cost function or coming up with an optimization objective and then using an algorithm like gradient descent to minimize that cost function when you have a very large training set gradient descent becomes a computationally very expensive procedure in this video we'll talk about a modification to the basic gradient descent algorithm called stochastic gradient descent which will allow us to scale these algorithms to much bigger training sets suppose you are training a linear regression model using gradient descent as the quick recap the hypothesis will look like this and the cost function will look like this which is this sort of one-half of the average squared error of your hypothesis on your M training examples and the cost function we've already seen looks like this sort of bow shape functions to plot it as a function of the parameters theta0 and theta1 the cost function J is the sort of a bowl shape function and gradient descent looks like this where in the inner loop of gradient descent you repeatedly update the parameters theta using that expression now in the rest of this video I'm going to keep using linear regression as the running example but the idea is here the idea of stochastic gradient descent is fully general and also applies to other learning algorithms like logistic regression in neural networks and other algorithms that are based on training gradient descent on the specific training set so here's a picture of what gradient descent does if the parameters are initialized at a point there then as you run gradient descent different iterations of gradient descent will take the parameters to the global minimum so take a trajectory that looks like that and heads pretty directly to the Google woman now the problem with gradient descent is that if M is large then computing this derivative term can be very expensive because this requires summing over all M examples so if M is 300 million so in the United States there are about 300 million people and so the US would be united states census data we have on the order of that many records so if you want to fit the linear regression model to that then you need to sum over 300 million records and that's very expensive to give the algorithm a name this particular version of gradient descent is also called batch gradient descent and the term that refers to the fact that we're looking at all of the training examples at the time the cause of a batch of all of the training examples it really isn't maybe the best thing but this is one machine learning people call this particular version of being in the center and if you imagine really that you know you have 300 million census records stored away on disk the way those album works is you will need to read into your computer memory all 300 million records in order to compute this derivative term you need to stream all of these records through computer because you can't store all your records and computer memories you need to be through them and slowly you know accumulate the sum in order to compute the derivative and then having done all that work that allows you to take one step of gradient descent and now you need to do the whole thing again you know scan through all 300 million records accumulate these sums and having done all that work you can take another little step using getting this end and then do that again and then as you take you had a third step and so on and so it's gonna take a long time in order to get the Apple to converge in contrast to batch gradient descent what we're going to do is come up with a different algorithm that doesn't need to look at all of the training examples you know in every single iteration but that needs to look at only a single training example one iteration before moving on to the new algorithm here's just a batch gradient descent album written on the game with that being the cost function and that being the update and of course this term here that's used in the gradient descent rule that is the partial derivative with respect to the parameter theta J of our optimization objective J train of theta now let's look at the more efficient algorithm that scales better to largely assess in order to work out the algorithms cost the Casagrande sent this write out the cost function in a slightly different way when I defined the cost of a parameter theta with respect to a training example X I comma Y to be equal to 1/2 times the squared error that my hypothesis incurs on that example exact on the Y I so this cost function term really measures how well is my hypothesis doing on a single example X I comma Y I now you notice that the overall cost function J train can now be written in this equivalent form so J train is just the average over my M training examples of the cost of my hypothesis on that example X I comma Y light island put this view of the cost function for linear regression let me now write out what stochastic gradient descent does the first step of stochastic gradient descent is to randomly shuffle the data set so by that I just mean randomly shuffle or randomly reorder your M training examples sort of a standard pre-processing step come back to this in a minute but the main work of stochastic gradient descent is then done in the following we're going to repeat for I equals 1 through m so repeatedly scan through my training examples and perform the following update to update the parameter theta J as theta J minus alpha times H of X I minus y I times thanks I okay and we're going to do this update as usual for all values of J now you notice that this term over here this is exactly what we have inside the summation for batch gradient descent in fact but they'll see you there are familiar of calculus as possible to show that that term here is this term here is equal to the partial derivative respect my parameter theta J of the cost of the parameters theta on X I comma Y only where cost is of course this thing that was defined previously and just the rap of the algorithm let me close my curly braces over there so what's the cost of gradient descent is doing is is actually scanning through the training examples and first it's going to look at my first training example x1 comma y1 and then looking at only this first example is going to take like they see a little gradient descent step with respect to the cost of justice first training example so in other words it's going to look at the first example and modify the parameters a little bit to fit just the first training example a little bit better having done this inside this in the for loop is then going to go on to the second training example and what is going to do there is take another little step in parameter space so modify the parameters just a little bit to try to fit just the second training example a little bit better having done that is then going to go on to my third training example and modify the parameters to fit to try to fit just the third training example a little bit better and so on until you get through the entire training set and then this outer repeat loop may cause it to take multiple passes over the entire training set this view of stochastic gradient descent also motivates why we wanted to start by randomly shuffling the data set this just ensures that when we scan through the training set here that we end up visiting the training examples and some sort of randomly sorted order depending on whether your data already came randomly sorted or whether it came originally sorted in some strange order in practice this would just speed up the convergence is the cost of gradient descent just a little bit so in the interest of safety is usually better to randomly shuffle the data set if you aren't sure if it came to you in the randomly sorted order or not but more importantly another view of stochastic you in descent is that is a lot like batch gradient descent they rather than waiting to sum up these gradient terms over all every training examples what we're doing is we're taking this gradient term using just one single training example and we're starting to make progress in improving the parameters already so rather than you know waiting till we've taken apart through all 300,000 united states census records say rather than needing to scan through all of the training examples before we can modify the parameters a little bit and make progress towards a global minimum for stochastic gradient descent instead we just need to look at a single training example and we'll already start making progress in the space of parameters to of moving the parameters to us the global minima so here's the algorithm written I'll begin where the first step is the randomly shuffle the data and the second step is where the real work is done where that's the update with respect to a single training example X I comma Y I so let's see what this algorithm does the parameters previously we saw that when we're using batch gradient descent that is the algorithm that looks at all the training examples of the time factor in descent would tend to you know take a reasonably straight line trajectory to get to the global minimum like that in contrast with stochastic gradient descent every innovation is going to be much faster because we don't need to sum up over all the trainings of the examples but every innovation is just trying to fit a single training example better so if we were to start stochastic gradient descent oh let's pass the cross between descend at a point like that the first iteration you know may take the parameters in that direction and maybe the second iteration looking at just a second example may be just by chance we get a little unlucky and actually head in a bad direction and with two prongs like that in the third iteration where we try to modify the parameters to fit just the third training examples better maybe we'll end up heading in that direction and then we look at the fourth training example and we will do that the fifth example six example seven and so on and as you run the stochastic gradient descent what you find is that it will generally move the parameters in the direction of the global minimum but not always and so take a some more random looking and circuited path towards the global minimum and in fact as he runs to Casa Grande descent it doesn't actually converge in the same sense as batch gradient descent us and what events are doing is wandering around continuously in some region that's in some region close to the global minimum but it doesn't actually just get to the global minima and stay there but the practice this isn't a problem because you know so long as the parameters end up in some region there maybe it is pretty close to the global minimum so lost parameters ends up pretty close to the global minimum that will be a pretty good hypothesis and so usually running so costly in this end we get the parameter near the global minimum and that's good enough but you know I'm always actually in most practical purposes just one final detail in stochastic drain descent we had this outer loop repeat which says to do this in a loop multiple times so how many times do we repeat this outer loop depending on the size of the training set doing this just a single time may be enough and up to you know maybe 10 times maybe typical so you may end up repeating this inner loop anywhere from once to 10 times so if you have a you know a truly massive data set like this u.s. census data set example that I've been talking about with 300 Minneapolis it is possible that by the time you've taken just a single pass through your training set for I equals 1 through 300 million is possible but by the time you take on a single pass for your dataset you might already have a perfectly good hypothesis in which case you know this inner loop you might need to do only once if M is very very large but in general taking anywhere from 1 through 10 pulses through your data set you know may be fairly common but they're really it depends on the size of your training set and if you contrast this to batch gradient descent where battery and descend after taking a pass through your entire training set you would have taken just one single gradient descent step so one of these little baby steps of gradient descent where you just take one small gradient to sensor and this is why so Casagrande descent can be much faster so that was the stochastic gradient descent algorithm and if you implement it hopefully that will allow you to scale what many of your learning algorithms do much bigger data sets and get much better performance that way
Info
Channel: Artificial Intelligence - All in One
Views: 61,648
Rating: undefined out of 5
Keywords: Machine Learning, Machine Learning Video Lecture, Computer Science, Video Tutorial, Video Course, Stanford Video Course Machine Learning, Stanford University, University of Stanford, Stanford, Online Machine Learning, Best Machine Learning video course, Andrew Ng, Andrew Ng ML, Andrew Ng Machine Learning, Andrew Ng Course, Andrew Ng Machine Learning Course
Id: W9iWNJNFzQI
Channel Id: undefined
Length: 13min 19sec (799 seconds)
Published: Thu Feb 09 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.