Lecture 7 | Acceleration, Regularization, and Normalization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm not going to be able to get through everything today so I was planning originally and recording something over the weekend and I I realized that it's a lot more effective for me to actually cover the early portions of this lecture in class because it's important and the trailing bits I will record a lecture of my own and I put them up by tomorrow so that will not define our like final but it's probably not gonna be very long maybe half an hour or 40 minutes best case okay huh so here's a quick recap everywhere everything that we've done we went through gradient descent and back propagation here is the basic formalism that we use for training the network we have a loss function the loss function is a function of the parameters of the network and it is defined as the average divergence over a whole bunch of training samples and we try to find the parameters that minimize this loss this loss has many components the total loss of course is the term L the summation and the division and front shows that you're performing averaging the dive is actually the divergence function that quantifies the difference between the desire output and the actual output of the network the F the divergence function is a power is a function of two terms the first is the output of the network itself F which is where the parameters come in so the parameters are nested inside the dive and then D of X is the desired output of the network in response to in response to the input X and the manner in which we actually train the network the last function again we can compute the gradient of the loss with respect to the parameters which is simply going to be the average of the gradient of the divergences for the individual training instances and the actual optimization we perform then we minimize L of W is through gradient descent where we use the iterative walking against the gradient to learn the parameters of the network now before I continue over here there's just one thing that I would like you to keep in mind that what we are actually trying to minimize is not the loss so you would have some function W here's the actual function that you want to minimize and based on your current setting of the weights you're going to have some other function this is f and this area quantifies the error between the two and you're trying to learn the parameters that minimize this area you're trying to where the error at any given point is quantified by the divergence so really you want to minimize the term you want to minimize is the expected value ideally is going to be argument over W the expected value of the divergence between f of X W and B of X this is what we really want to perform keep that in mind right so now the gradient of course of the total loss is the average of the gradients of the individual instances and that can be computed using back propagation what we've seen so far now the gradient descent back prop and gradient descent back drop is of course not a problem back propagation computes derivatives for individual instances but the problem with the gradient descent approach is that first we are minimizing a loss that relates to classification accuracy but is not actually classification accuracy the divergence isn't counting the number of errors you make you have a proxy and so it's can be and it's continuous valued yes so there we've mentioned this in several previous classes right there is an ideal function that we're actually trying to estimate that we are trying to approximate using our network our problem is that we don't know what this function is but in the ideal case what we would do is if we actually had a function we would be trying to minimize the total divergence between what the network is computing and the function that you want to compute so that function could be anything right we just don't know what it is you're right absolutely absolutely correct my mistake thank you I make this mistakes thanks for catching it the name Jess thanks sister yes so now now again minimizing the loss is expected to but not guaranteed to minimize the classification error right so again even simply minimizing the loss is hard enough we found that the when you use gradient descent the step size actually affects whether you're going to the algorithm is going to converge or not so in the for example if you look at the top row if you have a step size which converges which makes the solution converge some some more cleanly for one eccentricity of your loss function which is the figure to the left for a steeper at the eccentricity the same same step size can make things blow up on the other hand a step size that actually makes the solution converge in the second case is going to end up resulting in extremely slow convergence in the first case so the loss function again is a function of many weights and biases it has different success eccentricities with respect to different weights a fixed step size for all the parameters in the network can result in the convergence of one weight while causing a divergence of the other guy vergence with respect to one weight but causing a divergence with respect to the other guy right so we found that this having a fixed step size is a problem solution again this is a recap of everything we've done so far you can try to normalize the curvature in all directions so this we did be using second-order methods like Newton's method but they require inversion of giant Hessian matrices so that's infeasible you can try to treat each dimension independently like our proper quick prop it works but ignores dependences between dimensions and sometimes can result in unexpected behavior and can studies too slow too slow and so we came up with momentum methods this is where we're sort of closed off right now the principle of momentum methods was that if you had some behavior like the left case where the function was sort of converging smoothly then you know that you're not oscillating and going off that you're not diverging and so you can actually increase the step size whereas you have something like the situation on the left where you begin oscillating across the bowl then you can immediately figure out that you're oscillating across the bowl by looking at the sign of your derivative it's going to keep flipping and if that happens then you want to shrink your step size and if that resulted in there so if you did something of that kind in a function in you know in the J in the loss minimization framework you would find that for the figure to the extreme right it's going to stretch the increase the step size along the vertical direction it's going to shrink it along the horizontal direction it's going to result in a better where our convergence so we saw two different variants of this the first of course was the momentum method in the momentum method you take a step and then you'd you would compute the gradient at the current step so you compute a step along the gradient but then extend it in the direction of the current step itself so you maintained a running average of all the steps that you had taken so far and you updated based on the running average rather than just the current grade the advantage of doing this was that you are basically taking an expectation you're computing the average because they're doing the running average you are basically averaging all the past steps so you are updating based on the expected value of the step in each direction and if it's oscillating in some direction you'd expect the expected value tarts to sort of cancel out to zero and if it's sort of continuing in one in some direction you'd expect the expected value to stay consistent you can increase the step size so again keep in mind that your ACK when you're using momentum methods you are basing it on the expected value of step size right so if you have something like this then in this direction here it's a positive here it's a negative by this much here it's a positive by this much here it's a negative by this much these two would cancel out these two for example are going to result in if you just had these four steps the average of these is going to be something just launched right we are only looking at the expectations now Nesterov s-- method was just an octave more optimal version of the standard momentum where you first took the extended step and then computed the gradient as opposed to doing the opposite and it's easy to see why it's more optimal because you want to compute the gradient with respect to where you would end up as opposed to ending up where your video gradient tells you to go so all of these stuff from the past we're going to come cover a bunch of new topics today incremental updates and we're going to revisit trend algorithms and look at generalization and we're going to look at some tricks of the trade including divergences activations and normalizations so first let's look at this business of incremental updates now here's what we wanted to do let's go back to this function we had thank you again Justin this is a mistake ever better persisted here was a function that we actually wanted to estimate but you don't have the value of the function everywhere right so what we did was took samples of the input-output pairs at various locations and if you have a current estimate of the and we try to compute a function that will actually fit these samples now if you have a current estimate of the function remember the current estimate of the function when you if the current estimate of the function is like so you want to minimize this error this this shaded area but you cannot actually compute the shaded area so what you will do instead is to compute the average length of the error at these samples and you use the average length of the error at the samples as a proxy for the shaded area now your initial estimate is almost certainly going to be wrong everywhere so every one of those lengths is going to be nonzero and I've so you and you're trying to minimize the average length what we are doing what gradient descent is doing is considering the divergence is at every single point right so it's trying to correct the error in theory at every single point so it would try it may not succeed but it's going to try to reduce their increase the function value in the first two locations decrease the function value in the next two locations and then increase it again in the fifth location basically it's trying to it's going to try to adjust the function at every single training point to reduce the error between what you have and what you want and that's going to give you a modified function and then again you would repeat the whole thing you're going to compute the gradient across the entire data set all of your instances so which means that you have errors at every single location and wait where are my TAS do I see any MIT Asya no so all right can you actually monitor Piazza thank you so we are going to try to minimize the again we're going to try to make adjustments at every single training point to correct their right and that's going to give you some new function and you'd keep doing this now what happens here we what we are really trying to do is to simultaneously minimize the error at every single training point so what this means is that when you're trying to come simultaneously minimize the error at every single training point you have to process every single training instance go over the entire collection and make adjustments and that means you have to wait until all of your data points have been seen before you make a single adjustment now here's an alternative you can begin making adjustments at fun one one location at a time so you can think of it this way I have some function and I have a handkerchief that I'm trying to fit to the function it starts off with some you know unnecessary shape so I begin sort of pushing or pulling the whole entire handkerchief call at once to fit the function that's one way of doing it or I can sort of keep peeking and prodding at individual locations in the handkerchief but then I can't just pull it a whole lot and make this fit because going to change the entire shape I got a chain push it and project product by tiny tiny bits and adjust it locally to make the make the shapes fit right so you can actually you could actually sort of make adjust the function at one training point at a time like here you find that at that location the function is totally are totally off the yellow bar so you would push it and that's going to push the dotted line up to the red line and then you pick some other training point and here you find that it's still too low so then you would push it up at that point then you pick a third training point and here you find that the function is too high so you're going to push it down at that point you do it one point at a time and if you keep doing this in this incremental value or incremental manner so long as you keep the individual adjustments small and this is important you have to keep the individual adjustments small because again think of the handkerchief simile if I just try to make the whole thing fit at one point it's going to mock up the shape everywhere else so you just make local adjustments at each point and if you do this carefully one point at a time you know from experience you've obviously done these things we're trying to shape threads to curves and such like you can keep pushing it right you don't actually try to push the entire thread all at once you keep pushing at local locations if you do it slowly enough it's eventually going to fit and eventually you'd have gone over all of your training points if you do this carefully and made adjustments everywhere so this is this increment and you can potentially make greater overall adjustment if you're just then if you just try to adjust all of your training points all at once because when you when you poke the individual points separately and we'll see why this could be the case although there's going to be a little bit of divergence between the theory and your intuition so this mechanism is what we call this incremental update mechanism is what we call stochastic gradient descent where you're adjusting at one one point at a time and the pseudo code sort of explains this observe that I'm looping over all of the training instances and we are making updates to the functions to the parameters after each training instance now when you're doing this the iterations can make multiple passes over the entire data so we've got norman clay Chur when you make a single pass over the entire data we're going to call it an epoch within an epoch you're going to have many observations you would make an adjustment for every observation so if you have T training instances you will be making T updates within an epoxy going back to the pseudocode the outside do is looping over the entire data so the outside do is looping over a box now the yellow block the highlighted block is a single époque it's one pass over the entire training data but within in the entire track within each epoch you're making T updates to the parameters right now there's some caveats here you can't just do it any odd house you have to be careful so consider a situation where I have training data of this kind I have a bunch to the extreme left and a pass to the extreme right and if I'm sort of going through my data left to right in sequential manner and make me and keep me if I keep making my adjustments what will happen in the beginning it's all it's going to see a bunch of over estimates so it's going to push all of those down but if my function is naturally restricted to being a curve then when I push this down the result is going to be that the right side is going to swing up and then my sequential processing of the data is going to hit the data points to the right it's going to push to a stop and the data is going to swing right back so the problem is if you look through that and then if I go back and look through the data in the same order now the whole thing is going to go back up and then it's going to go back down so where if you move through the samples in cyclic order in in deterministic cyclic order or deterministic order you can end up with cyclic behavior of this kind where the function just keeps oscillating around so the you know like so so in order to avoid this we must make them sort of go randomly to get more convergent behavior so what you might want is to randomly pick data points and then maybe you'd end up pushing something to the left down a bit and then then you the next random choice is going to be possibly to the right and then you push that up a bit and so you would actually sort of begin making corrections and by being random you're ensure that you don't get cyclic behavior because you're sort of picking things all over the place now this random selection of power data points is extremely critical because if you don't do it this cyclic behavior will almost certainly emerge in most settings of your problems so to update our pseudocode this is what we're going to have to do we're going to have to randomly permute your training data between epochs just to make sure that each time you pass through the training data you're going through them in a different order in a different randomized order right so story so far and any gradient descent optimization problem presenting training data instances incrementally can potentially be more effective than presenting them all at once provided the training data are presented in random order this is a stochastic gradient descent the term stochastic comes from the fact that you're actually randomly training selecting your training samples and this is nothing special to neural networks this is for any kind of function optimization but particularly in our you know in our context it works really well when you're on networks so why would you expect this to cache incremental update to work better and under what conditions I mean intuitions apart so you're still going to sort of work with the intuitions but try to get a little stronger handle on the problem let's look at the first walk question there are two here why will it work and under what conditions let's consider the Y and I'm going to give you a simplistic explanation which is often given I'm going to look at an extreme example now remember when I compute the average that should have been an L then when I compute the average gradient over all of my training instead of the divergences for all of my training instances so when I'm computed computing the gradient of the loss with respect to my parameters I'm averaging the gradient for the divergences for each of my training instances but all of my training instances are different so their gradients are all going to point in different directions so you up if you have a function of this kind the gradients are likely going to look like what you have in the top right each of my training instances has a gradient which is a different length pointing in a different direction you're averaging the lot and that gives you the resulting divergence shown by the red line and that is the direction in which you're going to be adjusting your function but now let me give you an extreme case all of my training points are the same due to some bizarre reason so now if I have T training points and every one of them is the same then what happens all of my training instances all of my divergences are pointing in exactly the same direction the average is going to be exactly the same as as the divergence that the divergence for the the gradient for any one training instance and as a result if you performed a batch update after having processed your entire training data you got exactly as much information as you would have got if you had just processed only one training data training instance whereas if you kept updating your parameters after each training instance then in when in one pass through the training data you would have made T updates so you're going to get a lot more bang for the buck T times the bang for the buck and potentially if you're making incremental updates or were your training instances now this obviously holds if all of them are exactly the same training instance but then you can you can imagine that this is also going to be the case if all of the training instances are very close right again if you're going to be averaging and making a wave waiting to average all of them then effectively you're you're going to get the equivalent of just having seen one of these training instances or maybe the average training instance whereas if you increment or if you made updates after seeing every single instance you're going to get T times the corrections and as the data become increasingly diverse the benefits of incremental updates decrease but they do not entirely vanish even in the limiting case you're going to get some benefits now exactly what benefits we will see that in a bit right questions anyone yeah we haven't yet gotten too many batches so in practice you don't want to because you're just basically going to in a single a park you are sort of wasting your computation but in theory yes you're speaking up for literally random sampling it's stochastic so now we get some intuition of how this works why this works and any intuition you might have found based on this picture is actually going to be a fairly good intuition but then what are the considerations and how well does it work now this one comes one significant consideration is the learning rate if I have a whole bunch of training points and if I'm incremental II making an adjustment after each training point now consider this I have this collection of training points the average function they're following a linear trend if I'm trying to fit a line that red line over there is the optimal linear regression for the entire set of training points on average is the best-case but what happens for any individual training instance for any individual training instance it's absolutely wrong right so when you begin considering individual training instances one at a time your current function is almost certainly always going to be wrong and so if you're making Corrections for individual training instances you're always going to find something to something to correct even though you already found the optimal solution so how does this sort of how does this interact with the fact that we think that making incremental solutions can actually give you a give you an answer and this has to do with your step sizes if the step size that you use to make your correction remains constant then clearly you're always going to be chasing your tail because for the latest training instance you're almost certainly going to be wrong as time passes you want the step sizes to shrink us you must shrink the learning rate with iterations to prevent this and so correction for individual instances must be such that as you keep going through the data eventually the learning rate shrinking and become small enough that the corrections are going to be negligible so basically here again as you're a gradient descent pseudocode I have changed it a little bit and how so first there is your random permutation again but observe that I'm keeping track of how many updates we've made so far and the step size is being adjusted after each update and exactly how much the step size be adjusted now I made this statement earlier that in the generic case you really want when you don't when you have non convex functions for convex functions having a fixed step size works just fine in the generic case when you have non convex functions we said that you want your gradient your step size to shrink now I didn't actually make much of a clear I didn't actually give you much of a rationale behind it but that becomes much more apparent when you're looking at something like stochastic gradient descent the step sizes the curve stochastic gradient descent can give you the correct solution the optimal solution but under a specific condition that the step sizes allow you to explore the entire space which means that the sum of the step sizes must be infinite but at the same time the step size is shrink so that in the limit eventually things become small enough that you're not making adjustments and chasing your tail so this again means that the sum of the squared step sizes must be finite but the sum of the step sizes themselves must be infinite and again you can see the logic behind it if I'm in some space and if I have an upper bound on the derivative whatever the alpha then if summation each step size the the largest value for any step size is going to be a magnitude of alpha times a decay right and so the total correction is going to be K equals 1 to infinity alpha times okay assuming they're all positive and so if I can pull this alpha out so if this is a finite number and let's say this finite number is MC then the maximum correction I'm going to be able to make from my starting point there's going to be alpha C even after infinite steps which means that if I have some initial point over here if this is my initial estimate I'm never going to be able to find parameters outside this ball so to avoid this you want the step sizes to sum to infinity at the same time you want the steps to shrink so you want this sum of the squared step sizes to be fine it and the fastest converging series that satisfies both requirements is 1 over K and pretty much most learning schedules that you will see when you actually do your homework homeworks are practical implementations have some variant of 1 over K it's not exactly one our K you could be having the same step size for a while and then you reduce it and you have the same step size for a while and so on but it's going to have some format of this kind what you will have is that they they eventually shrink to 0 now you can define the convergence for all of these gradient descent rules here is this I'm going to need the space ok and the convergence can be defined in terms of how far the value of the loss function is from the optimal loss the minimal loss that you can get for this particular network and using the optimal learning rate you will find that for strongly convex functions the theory says that if you use to cast a gradient descent then the convergence follows 1 over K now which is you're going to after K iterations you're going to be 1 over K away from the optimal solution meaning eventually after infinite iterations you're going to be at the optimal solution but this is actually very slow convergence if you think about it now this is for strongly functions if you have genetically convex functions but they're not strongly convex then it turns out that you need a learning rate of 1 over square root of K and the convergence rate is only only 1 over square root of K but there's a differ that in any single pass through the data you're actually making T updates as opposed so here K is the number of updates not number of passes through the data now let's compare this to batch gradient descent in batch gradient descent if you have a strongly convex function then the rate of convergence is C raised to K where C is less than 1 so you can see that this gets to the correct solution exponentially fast although we call it linear convergence because you're speaking in terms of logs right and so to get to within an epsilon of a of the correct value of the optimum the number of iterations you're going to need is order of log of 1 over Epsilon whereas for stochastic gradient descent you're going to need order of 1 over epsilon so it's much and the log obviously greatly reduces the number of iterations are going to require to get to the solution that's what strongly convex functions and for generically convex functions the iterations to epsilon can is order of 1 over epsilon if you're using batch descent so the for strong the kind of convergence you get with stochastic gradient descent for strongly convex functions is comparable to the convergence that you get for functions that are not strongly convex when you use when you use batch descent so obviously you would expect path descent to be much much faster than stochastic gradient yes so strongly convex it's there this is there in the slider a couple of lectures ago a strongly convex function is something that sits within a quadratic bowl and a convex function is basically a bowl so you can have convex functions where you can put a quadratic ball inside the function which means it's not as far as a quadratic and other functions which sit inside a Quadra which sit inside a quadratic building meaning it's at least as convex as a quadratic function take a look at the slides right we have nice figures and everything in there anyway so here's as simple as Deeley example this is what k-means this is well not for neural networks now the red line is when you're using SGD and the green line is what you get when you're using batch batch you know batch updates firstly observe that SGD converges much faster then batch updates so as really actually the number of training cycles that you require over here is just one CPU second whereas to get to the for the error loss to taper off it took 100 seconds for the batch updates so that's a good thing but the flip side is the error is much higher the actual solution had found is much worse but then there's a bar what are these error bars if I rerun with different initializations I'm going to end up in different points and you find that the error bars are a lot greater much much bigger for SGD than for batch updates which means that even when it converges you can expect different solutions of different quality for SGD than you would for batch updates where you expect the solutions to be much more consistent so why is this the case now remember this is just the recap of what I have over here this is the function that you're trying to minimize you are trying to minimize the expected value of the divergence between the output of the function and the true desired output at each point right expected over the entire domain of the input so this is really what we are trying to minimize in the generic case G of X is the function you are trying to up to model F is your network you're trying to minimize the expected value the divergence between the two this is a statistical expectation what we are really minimizing is the loss which is the average divergence over a bunch of training points so our loss is 1 over N summation I equals 1 to n divergence of f of X I i'm just using a shorthand notation right if but then if I take the expected value of the loss that's going to be 1 over N summation I expected value of the divergence now the expected value of the divergence doesn't depend on the specific instance if I think of my my collection of n training samples as you know randomly chosen so this is simply going to be 1 over n times n times expected value of divergence these ends cancel out correct so what this means is that the expected value of my loss function is actually the actual objective that I would like to minimize so why do we actually use this loss I mean I mean why does this empirical loss actually make sense it makes sense because in expectation it's really the function that we want to minimize so we can hope that if you minimize this loss that guy too is kind of reduced if not minimized now this is hope there are no real guarantees it turns out there are all kinds of bounds but yeah but the expectation is that the function that you actually minimize must at least in expectation it must be the same as the function that you are that you would like to minimize right so the empirical risk is an unbiased estimate of the expected loss what do I mean by unbiased it means that my expected value of my loss is exactly what I would like to minimize so there's no difference between the two its unbiased which means that minimizing this could potentially also minimize the other guy right yes the left-hand side expectation is over all of the training points over the so over P of X there's a little gap between practice and and you know theory in the sense that we are also trying to make maximum are you utilization of all of your training data and that's basically why we do it and that's it right in fact the difference between sampling with and without replacement is actually quite small when your training data are sufficiently large so there's the other theorems and then then there's what what we actually do the bounce were sampling without replacement are somewhat to the proof system more different but then here is the more important thing right now look at this guy this is the average of the divergences over many training instances right so what is the variance of the loss what do I mean by the variance of the loss every time I take n training samples I'm going to get a different set of training samples I can compute the average loss over all of these guys right this average loss is going to be different depending on the training set so there's a variance across different sample size samples what is the variance of this car if you actually work it out now if you assume that all of your training data training instances are independent then this is the variance of the average of n samples so the variance itself is simply going to be the average of the variances for each of these cuts and now I don't know if I can show this ball I can't simply they are depending on the fact that if I have X hat equals one over N summation I X I wear all the exercise are identical and identically distributed and independent iid then X hat equals x one over n plus X two over N and so on right so the variance of X hat is simply going to be the sum of their variances which is going to be the variance of X 1 over n plus the variance of X 2 over n and so on which is going to be 1 over N squared variance of X 1 plus 1 over N squared variance of X 2 and so on but the way all of these variance is identical because X's are identically distributed there are n such terms so this is going to be 1 over N squared times n times variance of X ends canceled right so it's so here I'm just writing x on the board but this is any random variable and so in our case the divergences of the terms that we are interested in and so the variance of this loss is actually 1 over n the variance of the divergence for the individual training instances right what this means is that over different samples you're going to get different divergences but the variance across the divergences is going to depend on the number of samples now if I'm performing stochastic gradient descent at in any individual update I am trying to just minimize the divergence for one training instance so at any single sample the loss that I'm trying to minimize is simply the divergence for a single training instance so the expected loss is the expectation of this divergence which is simply going to be expected of the damage this term here which is exactly the same as this guy that we were trying to minimize right so what this means is that even when I'm performing stochastic gradient descent the actual loss that I'm trying to minimize at each step is an unbiased estimate of my true objective function the real problem lies with the variance and how is that if you look at this what is the variance of divergence of this loss function it's simply the variance of the divergence one times the variance of the divergence right so the variance over here is simply going to be the variance of the divergence now compare this with the earlier case where we had 1 over N times the variance of the divergence so in the batch divergence scale in the batch in the batch update case the variance of the loss function that you're trying to minimize is going to be one end of the variance of the loss function that you would be minimizing if you were performing stochastic gradient descent now all of this may not make a lot of sense in terms of just abstract math but then let's look at some figures to see how this works so here the blue curve is the function that you're trying to approximate the red curve is the approximation that you get at a current estimate for W and you want to minimize the area of the shaded the shaded area right now the heights of the shaded regions at each point represent the point by point error basically the divergence right and we want to find the W that minimizes the shading a shaded area except that we don't have the entire function so we are collecting a bunch of samples and we are computing the average heights of those blue lines and using that as a proxy for the shaded area correct so now given this what happens the average length that you get for this for these lines going is going to depend on the positions of these training samples so if I have this first collection right of lines now suppose these were my training samples then that's going that's going to give you one set of one average but in that say in a different experiment so look over here and where I have some samples from the red regions I have some samples from the blue regions I get some expectation for the error right suppose I have a completely different set of samples this set of samples is entirely from the blue regions so if I only got these training samples I'm going to be I'm going to get the idea that I should reduce the value of the function everywhere or is this case and this case increase the value of the function everywhere which should be wrong for the red regions right just by getting a different set of training samples I've got an entirely different idea of how how I must correct the function or if I got something like this then I'm going to get a different idea of how I must adjust the function in order to make it a better approximation so different training samples there are different sets of training samples are going to give me different ideas of what the actual error between the true function and the current approximation is and so our idea of our notion of how to correct the function is going to depend on the locations of the samples now how can we make this more robust to the locations of the samples anybody for me normalization no normalization is still not going to change the positions of the samples right you still have the same error the only real way you can do this is to have lots more samples so if I have many more samples then the more and more the more samples I have the better my approximation really is going to be or the better the idea the better idea is going to give me about how to adjust the function and so basically what we say when we say that the loss the variance of the loss is inversely proportional to the number of samples is basically telling you this that if I add more and more samples then if if I the difference in how I actually compute the loss across different sample sets is going to be smaller when I have more samples right but then let's take the extra extreme case suppose I have only one sample this is what happens with SGD if my one sample happens to be at that one point it's going to tell me everything is fine the world is fine don't bother on the other hand if my sample happens to form here it's going to tell me the function really must the value of the function must really increase if it happens to fall here it's going to tell me the value of the it's going to tell me the value of the function was increase a lot here it's going to tell me the value of the functions must decrease a lot so as a result when you have only one sampling point when you're sampling the function at only one point your idea of the actual error can swing extremely wildly the variance is very large and that sort of explains what's happening over here when you are performing SGD you're across different runs you get different solutions and those solutions can vary quite widely whereas with batch gradient descent are going to have much smaller variation right so SGD uses the the gradient from only one sample at a time and it's consequently high variance but also provides significantly quicker updates because if you are computing T updates and if the data are not randomly scattered but if they are clumped then really as GD is going to give you a much better solution for that data instance right so is there a good medium well that's where your mini batch update comes in and the mini batch update you adjust the function that small randomly chosen subsets of track points keeping the adjustment smalls exactly as before and if the subset covers the training set then eventually you're going to color see all of your training data so you know if across the subsets you cover your training set eventually you're going to see all of your training points so for instance instead of looking at all six points in the first first batch you may be taking these three points you make corrections for these three points gives you some correction then in the next step you take randomly select three other points you're going to make Corrections for these three other points and that's going to give you some other Corrections and so on so the the pseudocode you can see how this works out it's exactly the same as far as Gd except that now you wait for an entire batch of updates you compute the average derivative over the batch and use that in your updates you still have to have this business of making sure that your step size shrinks so now why would mini batches work you know why would he expect mini batches to work better than SGD now the batch loss for mini batches is going to be the average of the divergences across the mini batch right what is the expected value of this batch loss the expected value of the batch laws are still some you know the B's cancel out it's still going to be the expected value of the divergence itself so as a result the batch loss is a an by is an unbiased estimate of the actual objective that you're trying to minimize now this doesn't make it the this holds true for a you know stochastic gradient mini batches and for full batch updates where the kicker is is in the variance so if you look at the variance of this loss this variance is going to be the variance of the divergence divided by B which is the mini batch size and so the difference between you know SGD you had a scale factor of 1 for batch you had a scale factor of 1 over N and for mini batch you had a scale factor of 1 or B right and if you plot this as a function of you know that size let's say this is n this is 1 you're going to find this is 1 that's what I said this is one you're going to find that this falls off very fast right so basically the difference in variance between a mini batch and a full batch update is not very much at the same time the mini batch gives you many updates over a single pass over the entire training data so now there is a little bit of there's a little bit of you know hand waving over here in the sense that if you actually look at the convergence rate then the four convex functions not strongly convex generically strong convex functions convergence rate for SGD is the order of 1 over square root of K for mini batches it's 1 over B times square root of B times K the 1 over K term doesn't really matter right so yet this because the denominator is increasing it would look like the mini batches are converging much faster right the problem really is that within each mini batch you are going to have your processing be instances so the actual computation it actually ends up there in terms of CPU cycles mini batches actually end up being slower then batches then stochastic gradient because although across iterations it's much faster each iteration is going to take B times the time that's where vector processing comes in if you begin throwing your GPUs and suchlike at your processing the B factor in computation basically goes away and then the convergence is much faster so if you actually look at the real experiments here's what you get the Green Line was batch descent ready stochastic gradient and the blue is the mini batch and you can see the mini batch it's not as quick as SGD but it actually gets to the real solution it has very low variance and gives you essentially almost the same solution as as full batch but orders of magnitude faster right so now in all of these those plots are actually about showing you the overall training loss but in reality when you're actually performing your computation you're never computing the loss of your entire training set after making that bits you're only computing the loss over the current system instance or the mini batch search so if you want to make plots like these it's going to be quite impossible unless you're willing to waste a little burn a lot of CPU just for the you know sake of educating yourself so more estimate Liam more typically we estimate the loss as divergence or classification error on some held-out set or by maintain some maintaining some kind of a running average or the past several batches but you get the idea the distinction between SGD full batch and mini batch and the trade offs right so in practice training is generally performed using mini batches the mini batch size itself is a hyper parameter that must be optimized why so because we've seen that the variance in your loss depends on the mini batch size one over B and four depending on the specific problem that you're trying to solve there's going to be greater and your initial estimate - there's going to be greater or lesser sensitivity to this particular variance the convergence depends on the learning rate so standard techniques you fix a learning rate till the error platitudes then reduce it by a fixed factor and so on again make observe that this convert this sort of satisfies the requirement that the sum of the squared step sizes is finite and now the requirement that the sum of the step sizes itself is infinite is never really a practical requirement because you're never really going to explore the infinite space your data tend to be restricted so that's something we tend to be a little more cavalier about now you can also have adaptive updates where the learning rate itself as determined as part of the estimation now and so the story so far as GD presents training instances one at a time and can be more effective than full batch training for sdd to converge the learning rate must shrink sufficiently rapidly with iterations as GE estimates have variance than batch estimates mini batch updates operate on batches of instances at a time estimates have lower variance than SGD but convergence rate is theoretically worse than SGD in terms of CPU time not in terms of iterations and so we can we can take this exploit this gap between CPU times versus iterations by performing vector operations and and techniques such as using your your GPU right now convergence depends on the learning rate the simple technique you know you fix a learning rate till there are plateaus and so on right advanced methods you use adaptive updates where the learning rate itself is determined as part of the estimation so let's revisit trend algorithms we've seen the momentum method updates are obtained using a running average of the gradient we've seen this right exactly how the momentum method works so you at each point the actual step size is going to be some linear combination of the step as the previous step and and the current gradient so you can also use the sued SGD when you're using her to the very is using it to the SGD you're going to be making a lot more corrections but the momentum method still works it turns out using things like momentum there's going to be a little somewhat more important if you are doing as GE as opposed to batch because the SGD and mini-batch gradient tend to have a have higher variance the losses have higher variance the gradients have higher variance also also because you're sampling the data which means that something that smooths out these variations is going to be important right so here's a just some pseudocode but let's get wrong same thing with Nesterov accelerated gradient now and Nestor's method you first extend the current step and then compute the gradient you would do the same thing with SGD as well nothing really changes it's still going to be in principle in principle it's still going to be more optimal than this basic momentum method and once again it's going to give you the same property that it sort of smoothes out the oscillations that occur particularly with these stochastic methods right so here is Nestor's method and we are moving on now here is something that in all of this that you'll actually write this somewhere yeah this block when we were using momentum methods or things like Nesterov s-- method what were we really doing the steps in any particular direction if they were if they were sequentially converging you were if the steps we're sequentially converging you sort of made them longer right on the other hand if the steps we're oscillating you took the average and it becomes very smaller zero so you are always looking at the expected value of the steps you are averaging the steps in each direction to decide what happens that's basically looking at the first moment of your steps right you're just looking at the average can we do better than that now remember that when you're doing things like is gds GD and mini-batch there's a greater variation the variance of the object is is larger the variance of the divergences is larger you expect the entire algorithm to swing around a lot more than if you're just doing something like simple mini bats right our simple bats descent and so having something that smooths out the steps is going to be at it somewhat more important when you're doing as GD then if you're just doing batch descent and just looking at the first moment maybe that's not sufficient maybe you can take one step more and look at the second moment of the steps so what would the sticky and second moment of the steps be over here if I look at this guy look at the variance or the mean squared value of the steps in the first case it's going to be something right so now I have two things going on they are all pointing in the same direction and I have some RMS value now here's the second case they are oscillating around now suppose I have this versus this right if I'm looking at expectations for both of them there is the same as far as I'm concerned the positive and the negative are cancelling out but is that enough clearly in the second case something is off right my step sizes should perhaps be shrunk more than just by what I would if I were only looking at the average in the second case so we have a lot of methods that begin looking at the second or second order the second moment terms for the gradients when you actually decide how to adapt how to how to make corrections to the current estimate and these are things like rmsprop erragadda delta adam all of them are basically looking at the second moment instead of just looking at the first moment so the basic idea simply is still the same right if you look at this is just trying to explain what I just drew on the board simple gradient and acceleration methods can still demonstrate also oscillator B and actually actually let me skip the slide you know the here's what we really want to do we want to sort of not merely look at the average value we also want to look at the second moment we want to look at the variance or the mean squared value here even though the average cancels things out it cancels things out in the same way that it does over here what I would like to say is that not only is the average canceling things out but over here the oscillation is not very much so my step size is you know even if I have a moderately large step size it's not really killing me I'm managing to control it in this case the oscillation is wild so this has large radiance large second moment I want my step size itself to shrink in the second case so we're going to start off incorporate that into this into these variants normalized step sizes now here we are beginning to look at step sizes not just the gradients because the gradient is what it is right so the thing that you have control over is the step size and so you're going to modify the step size so here for example say the total moment in the Y component of the updates is high but the moment in the xpac total moment in the X components is lower although the X is going consistently right if I have something like this this is the total movement in X but the total movement is YY is the summation of all of these variations which is very large so when things oscillate you expect the total movement in the direction in which it's oscillating to be fairly large so we what we would do is to sort of use the same gradient based updates as before but we're going to shrink the step sizes in the directions where the oscillation is large where there's a lot of movement and you're going to expand the step sizes or shrink to a lesser degree in the directions in which the movement is much lesser and this is called according to their variation so this is the basic notion behind these second-order methods now the simplest one is rmsprop rmsprop only considers the second-order term it doesn't consider the expectation so remember that we have two things going on over here one is the expected value of these oscillations the other is the actual variation itself right so rmsprop is only considering the second-order the the the variance the second order moment what you do is to try to keep a running average of the second moment you basically try to compute the second moment of the derivatives in each direction so you compute the squared the squared derivative at any time I'm going to write this as dou squared of D these represents divergence this is not the second derivative is the square of the derivative I'm just using this as shorthand notation the mean squared value of the derivative is going to be this guy right and if the mean squared value of the derivative in any direction is large I'm going to shrink the step size if it's small I'm going to expand the step size this was rmsprop and so here is how the entire procedure works you maintain a running have estimate of the mean squared value of the derivatives of each parameter which is just exactly a standard running estimate function right and then the step size this is the standard gradient descent observe that I don't have a momentum term all I'm doing is your standard gradient descent except that the step size itself is proportional to the root mean squared value inversely proportional to the root mean squared value what that would do is naturally make the step sizes for something like this much smaller than the step sizes for something like this but some very simple method yes it will it just means it's going to just increase or decrease the step size in every direction it's pretty it's pretty agnostic to the direction but the point is that you expect if you have something like a bowl and a convex function you expect the derive the derivatives to keeps becoming smaller and smaller and smaller and so you don't expect the that kind of wild swinging and so reducing the step size is not a big deal whereas in the transverse direction that's actually going to matter a whole lot all that happens is that in directions of great swing the step sizes are shrunk more than in directions of consistent progress but you will actually find out that this affects the convergence in your homeworks and your practical implementations right the problem with this is that it's not considering the expected value it's only considering the second order moment right now what would be a better thing to do so this is going to be your rmsprop and the reason RMS interesting is if you actually gone through your slides it's very similar to our crop our prop has a step size that doesn't actually end up considering the value of the derivative of the derivative at all the hard prop just has a step size which keeps getting scaled up and down that's exactly what RMS crop is doing I have the value of the derivative divided by the root mean squared value all that remains in terms of magnitudes is the sign of the derivative so rmsprop looks like a whole lot and looks a whole lot like our crop right now and so if you actually plugged it into your pseudocode that's basically what it would look like now obviously the immediate obvious next step is why am i ignoring the expected value I should be considering both the expected value and the second moment should I not and it turns out that Adam as the algorithm that actually does this Adam has updates where you actually maintain in running our estimate of the mean derivative for each parameter you make a you maintain a running estimate of the mean squared derivative and then you actually have a step size that looks at both the mean but it's normalized by the RMS value so you're considering both the expected derivative and the variance this is basically momentum with normalization right this is Adam now Adam has an extra term over here it turns out that if you just naively implement Adam you get some you can get some crazy behavior so you have some you have this extra time I'll leave you to figure out why this works which makes sure that in the earlier iterations things work properly and because I'm kind of running out of time and there are other variants to the same theme like a degrade a de Delta Adam axe now all of these basically the the actual variation as in how it treats the the statistics of these derivatives the basic idea itself is very nicely captured between between rmsprop and Adam in that when you actually have these derivatives which are the gradients which are swinging around then you want to sort of capture the average trend but you also want to make sure that you sort of scale down for things that swing a lot and you want to catch so and so you want to account for both first and second moments or any other moments that you might think of so you have all of these other solutions now there's a very nice visualization for how these things work on this website down there these guys have visualization for Beals functions and several others and you can see how the various techniques actually work this function has a minimum out near that dark blue to the extreme right it's shown to the left and the figure 2 that I chose the equal value contours yeah as GD you have momentum you have nest trust method at a grad had a delta RMS property doesn't show Adam for some reason but the adder grad is going to behave somewhat like Adam and you can see that well the rest of them have gone home but as GD is still sort of it's not the momentum the momentum techniques tend to swing out observe whereas rmsprop actually gets there somewhat cleanly and if he actually had Adam it would gather even better you know even more cleanly in this particular problem or here is another optimization there's a nice little saddle and the direction of the optimum is obvious right and you can see added Altai just sort of zooms by rmsprop is getting there and poor old SGD is stuck somewhere on top his figure is happier exactly where he is right it's gives you an idea now this is this is another crazy function which one got you know which one one over here the pink one nest rafts method one oh yeah so again why this variation and the behavior of the turf all of these it really depends on the last function right now the last function has crazy eccentricities then you expect something that accounts for the swinging around to do a for both the expected the both the magnitude of the swinging around and the expected value of the swinging around to do better that's something that Arkansas only one of the two and something that doesn't account for either is always going to be really bad which is where is GD actually just hangs around and doesn't do anything very good you can see rmsprop sort of accounts for the swinging but nesters method which actually which in this function particularly because you don't have crazy eccentricities you'll expect something like an estrus method to work really well and so you can do that it actually does that right in the previous technique whereas in the previous example you do have some crazy eccentricities so you expect Destro's method to not necessarily be as good as something like rmsprop at some point it takes off it finds that furrow right but otherwise it sort of keeps bouncing around and again if you had something that accounted for both the expected value and the second moment it would perform better still so here's the story so far gradient descent can be sped up by incremental updates convergence is guaranteed under many conditions for the learning rate must shrink with time and SGD which are bits after each observation can be faster than batch learning but has high variance mini batch updates where you update after batches can be more efficient than SGD and basically gives you some somewhat of a best of both worlds provided your learning rates are carefully chosen convergence can be improved using smoother updates like rmsprop and other techniques which actually look at the the the distribution of the of the derivatives right so what I would have done next is to actually look at I have five minutes so let me look at one more topic let's look at things you have to consider you know things like divergence and what are the tricks you can employ so let's look at the divergence first and the rest car will offload to the recorded lecture now here's the training total training loss it is the average divergence over the entire training set right now the convergence of gradient descent actually depends on the divergence the better the behavior of the divergence function the better it will can grade Ian's descent will converge ideally it must have a shape that results in a significant gradient in there are probably appropriate directions and in the wrong directions you should not have much of a component of the gradient at all and this is what would guide the algorithm to the right solution so if you have something if you have these three loss functions the one to the left is a really terrible loss function it's all you know P key and has low all kinds of local minima you really don't want something like that now both loss functions to the right are nice and smooth and have a global minimum but the first guy is not great because initially it's very shallow you're going to take forever to actually get to the valley where you want to be and then once you're in the valley it's so steep that you're likely to be you know you know overshooting and swinging around the best kind of loss function you want is the one to the right where it's initially ski is steep and then as you get to the optimum it sort of becomes shallow so you can get there in a much more controlled manner right now so and that's basically what this figures showing something to the left you're going to get behavior of this kind although it is it has a nice unique lower minimum whereas the figure to the right is going to give you much cleaner convergence now let's look at our two most popular choices of divergence one is the l2 divergence and the other is the kullbackleibler divergence right we've seen this enough in your homeworks that you know exactly what each of these two are why choose l2 over calbeck al or vice-versa now in many many applications the l2 divergence has actually long been favored anytime you do any kind of math they always begin talking about the l2 divergence unless you're looking at classifiers right so it's particularly useful when you're trying to perform regression like you're trying to predict real-valued of real valued outputs but then if the intent is classification the kullbackleibler divergence is actually more appropriate and why is that now here's something surprising I'm going to tell you that the l2 divergence is not convex right now l2 is how do you define l2 again it's a quadratic right the l2 is going to be what the l2 divergence is f of X minus the diver you know D of X whole squared but this 1/2 can be ignored as a scaling factor right this is this is a quadratic so you expect quadratics and what do quadratics look like quadratics have this nice ball shape right there's nothing more convex than that but here's the point if I'm doing something like a logistic regression so then I'm going to have a sigmoid or a soft max which is basically the same thing as a logistic regression in many dimensions then I'm going to have a sigma of X W DX minus whole squared what now this is quadratic and nice and bowl-like with respect to Sigma but what does it look like with respect to W it turns out that when I look at the behavior with respect to W the kullbackleibler divergence and the l2 look very different the figure to the left is the hell - it looks like a flower it is literally the function I told you there was a bad form right whereas the KL divergence actually gives you this nice beautiful board as a function of W not as a function of the output of the softmax right and so if you ever so in answer to the question that I got a long time ago why l 2 and not why KL and not L 2 in these cases here is the answer if you are actually looking for a classification problem where you are trying to look for loss functions that are kind of convex in your in your parameters it turns out KL is actually convex and l2 is not and just one final note for l2 divergence when you take a derivative of the l2 divergence you're going to get this which is basically the square of the error the derivative is going to have an error times a Jacobian right so the actual error term itself ends up affecting your gradients which is why we mentioned this earlier when you compute your back propagation the Y minus D term remains throughout and so you end up calling it error back propagation that's just of only about nomenclature and so to close the story so far the choice of divergence both affects your ability to learn the network and the results that you will obtain from it and in our particular problems if you're trying to train a classifier care it's probably going to behave better than l to stop here and I don't think we have time for questions if you have any post them on Piazza right yeah thank you
Info
Channel: Carnegie Mellon University Deep Learning
Views: 3,472
Rating: 5 out of 5
Keywords:
Id: fChBkJ_UjRw
Channel Id: undefined
Length: 79min 6sec (4746 seconds)
Published: Mon Sep 23 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.