Lecture 6 | Convergence, Loss Surfaces, and Optimization

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

why do we have on the five six seven eight nine 10 11 12 13 14 15 16 and 17 including this boy in the class this is lecture number 6 right so it's an exponential fall off I can hope that by then you know lecture number 10 there will be one person in class so have you folks been following Piazza at all he loves our there are two categories of questions three one is the you know usual person who's got some kind of minor confusion the second is a category of really smart questions good questions that I enjoy seeing and very often you will find mistakes of what I say and I like I mean I get I get defensive I can't help but I'm human but the fact is I'd like to if people find mistakes at work with what's on the slides or what I say I'd like to be informed of it so that instead of making a fool of myself just work you know infinite times all the way into the future it stops in this class and I don't do it the next time and then there's a third category of questions which basically tells me that this person hasn't watched the video he hasn't seen the slides or if they've watched seen the slides and watch the video they've done it at 1.5 X and jumped over various portions so that you know they can finish up one and a half hour lecture in 15 minutes and then you get questions where it's very obvious that they've done that right so what I plan to do now henceforth is keep track of questions which are obviously from students who haven't bothered to actually follow the material I can't act I can't penalize you for attendance because we don't have mandatory attendance we cannot I can't penalize you for not watching the lectures so the keeping track of this and so henceforth this is not for you guys but anybody who posts questions which whose answers are already on the slides and lectures in the videos you might just lose points and I'm on number and I haven't yet decided how much I am going to be assigning to attendance points but it's got to be sufficient that it actually worries you right because the point of my coming here and doing these lectures is that you're supposed to be watching this stuff and if you already know this stuff you have no business being in this class right thirty people in lecture seven is kind of not okay alright so we're we it's star finished in the last class before Mike's lecture was we had just sort of seen how neural networks can be trained we are gradient descent and that the gradient descent and the derivatives required for gradient descent can be computed using back propagation and that back prop based gradient descent so again the people and that and and the the people that talked about it in the manner in which I presented it I said back propagation is not guaranteed to find a find the correct solution even if it exists so there's something wrong with this statement what's wrong with that statement no it's much more mechanical backprop is not solving the problem backpropagation what is it actually you do it calculates gradients it calculates derivatives why does calculating derivatives have to do what does it have to do with you know finding the optimal solution it's what you use the derivatives for that matters right so the statement is fundamentally broken it just happens to be what was the title of the paper right and we also saw but anyway the point is gradient descent based methods might not find you the optimal solution of farm might not find the solution what they're really trying to say and what we were really trying to say is that the local minimum that you find that local minima in the objective function used to optimize the network remember we'd built an OP to an objective function that was differentiable because classification error is not differentiable that might not actually the local minimum over there might not actually correspond to the answer that you want and the reason for this was very simple the reason for this was when you compute when you build when we design these objective functions we are sort of designing an objective function which carries information about how far you are from thee how far the data instances are from the boundary so for example if I were using something like a sigmoid if I have a couple of pluses over here and and a minus over here this kind of decision boundary versus and say I have a bunch of minus here I have two different decision boundaries over here and if I'm using a sigmoid and if I have any kind of limit on how steep the sigmoid can get that means the fact that these guys are very close to the boundary means that these guys aren't actually giving you enough positive points to account for the fact that this guy is misclassified this guy's correctly classified right this extra instance might actually make up for the miss classification error of the negative point and this happens because you're actually taking into account how far the data instances are from the boundary and the reason we need that is that's how you make the whole thing differentiable right and so that's one of the that's what that's one reason the even though you might be hitting the optimum off your objective function it might not be that perfect answer but we claimed this was not a bug it was a feature there it keeps things from swinging around widely that you're by or trading off bias in favor of variance right so we assumed anyway that the training arrives at a local minimum thank you very much I got it thank you very much if I happen to fall asleep while teaching this wake me up okay now the now what do I mean by saying does it always converge what does it even mean to say does it converge does the remember this is an iterative algorithm it has to walk its way to the minimum the question is will it actually come to the minimum now it might not it might actually you might be doing all the right things but it but is but it might not come to the minimum and we will see how this can happen and so all of this it's kind of hard to analyze these for an MLP because we saw the loss function of an MLP can be very complex that was the last thing we finished with there is one class of loss functions that we are very comfortable analyzing namely quadratic functions right now there was an in there was a nice discussion on Piazza and in the past couple of days which I happen to actually catch on the plane where people are asking does the second derivative condition really is it sufficient to find a minimum or a you know if you just if you say the derivative is zero there's the second derivative condition satisfies to decide if something is a is a local minimum or a local maximum or not and the answer is it doesn't right so why have we actually been inflicted with this theory all our lives that you must check the derivative and then you must check the second derivative the second derivative is positive we are fine it's a local minimum that's not a sufficient condition I mean that's that's a sufficient condition that's not a necessary condition right so in the sense that it's you can come up with functions where the second derivative with me may be zero the fourth derivative may be zero the nth derivative may be zero you got to keep for looking further and further out and so the second derivative by itself doesn't tell you anything if I have a function of this kind or this kind or this kind right out here the first second third fourth fifth sixth and seventh derivatives met all these zero just by looking at these guys you can't tell me if it's a minimum a maximum or a our inflection point so why are we speaking of second derivatives anytime we speak of second derivatives we are doing something that is very common we use the so called street light effect you might have heard of this jerk right that some guy lost a coin in the bar but he's searching for it where he can actually look and that's exactly what we do we have no idea how to analyze the functions in general but we are pretty good at analyzing quadratics so we're just going to come up with second order approximations to everything and every statement we make generally tends to be in terms of the second order approximations and insects when it comes to second order when it comes to second order approximations you can actually deform some sort of a some sort of a best-case scenario and why are they our best-case scenario so think of comparing let's say X squared that was a very bad x squared versus magnitude of X versus X raised to 4 and so on right so this is magnitude of X this is x squared this is X raised to 4 you can keep raising the orders all of them have a local minimum at 0 right the problem with the magnitude of course is that it doesn't matter how far away you are the derivative doesn't tell you how far away you are the derivative is a constant and add the optimum it's not differentiable so that's an issue this is quadratic it's beautiful yeah the quarter it doesn't look like what I've drawn over I'm really bad at drawing quadratics but the derivative actually is a function of the distance from the minimum that's a linear function of the distance it you carries information about how far away you are and it continues to be nonzero all the way to the point till a significant because it's a line right all the way to the optimum the fourth order function is also has also got a minimum at 0 but what would happen out here so let's say you were doing some kind of a gradient descent this guy would nicely step and come out here because all because the DeRay you continue to have significant derivatives almost until you actually hit 0 right this guy is going to take giant steps here but by the time it comes here the derivative is so low so weak that it basically flattens out your standard algorithm will not be able to differentiate between this entire range of points based on the resolution that you're using right the steeper the function gets at the sides the shallower it gets near the near the minimum so in that sense the quadratic is this beautiful you know crossover point between something that's that that can't really be that behaved that is analytically difficult to handle and something that becomes too flat for you to actually get reasonable a reasonable handle on how it's going to behave so and which is why we talked about quadratics and pretty much everything that we do now the last functions that we assume they are not there they are generally not convex now a convex loss function is any function that's cut off sort of continuously curving upwards all of these are convex loss functions what do I mean by a convex function if I have a function of this kind if I take any two points on the curve and if I connect them up with a line then every point in the line is going to be caught about above the curve that's a convex function it's a very simple definition so over there are the curves on top they are convex the curves at the bottom the curve at the bottom it's not convex although it has a nice global minimum right you can also speak of convex sets and and there many ways I've actually mathematically characterizing convexity they all mean the same thing you can also speak of convex sets if I have a set in some space a set is convex if I take any two points and connect time up and every point on the line that connects these two points also falls within the set so the set on top that's convex the if you take the contour plot of a convex function all the contours equal value contours are going to be convex also the set at the bottom that's not convex because when I'm I travel in a straight line from X to Y at some point you have to leave the set general analyses are going to assume convex functions over convex supports because these are convenient to analyze the searchlight effect now what do I mean by convergence of gradient descent now an iterative algorithm is said to converge to the solution if the value updates eventually arrive at a fixed point okay whether and when I've been by fixed-point I mean the location were there a point where the gradient is zero and further updates do not change gradient based updates do not change the estimate so I have now your gradient descent method might not actually converge even though the optimum exists and we'll see why in a bit okay so what could happen you can have behavior like the one on top you keep taking steps and each point you're taking a step which is a function of the current gradient and then the steps are going to get shorter and shorter as you get close to the optimum and then you stop that's a nice converging solution then in the second case it does kind of get close to the optimum but doesn't actually get to the optimal keeps bouncing around and that's a jittering solution right that's not converging and the third case instead of actually going to the optimum you begin taking two steps that are too large and get out of it you know you basically swing out so that's a divergent solution the kind of solution we want we want it to be converging right now what are the conditions for convergence when will the function actually behave in this manner where the solution actually goes to the optimum point now you can speak of a convergence rate and the convergence rate basically is is often defined in terms of this there many ways of defining convergence rate it's basically the convergence any any any numeric quantification of convergence of series can be used over here because there's really a series you have a series of steps right it's a mathematical series but two so what you're really trying to see is if the series is converging and one standard mechanism for quantifying the convergence of the series is how far away are you from the final end point of the series here in this case the ideal point that we want to get to is the actual local actual minimum right so if you are walking in this direction if this is your minimum at this point this is the distance at this point after one step this is the distance this ratio tells you what fraction of the remaining distance you have come and that's that that's one that's that's one nice way of quantifying how fast you're currently converging and if this ratio is constant or upper bounded then the then the convergence is linear because what this means is that if it's not upper bounded then the distance of the next point can be much greater than the distance of the current point right so also obviously the sequence is blowing up you want this ratio always to be less than one that's when you that's that's when you know that you're always sort of going closer and closer to the solution even if this ratio is occasionally greater than one so long as on average it's less than one you can expect that you're going to be going closer and closer to the solution right and so the now if I have a constant of this kind and if the constant is lesser than one we will call it linear convergence it's linear because what happens is the actual the actual rate of arrival at the optimum is exponentially fast so if you take K steps then at eats after K steps you're going to be C raise to K closer to the optimum than you are from the beginning then you are at the beginning but if you think of it in logs they're going to get a K log C so in the log domain it's linear in K so we call it linear but it's actually when I speak of linear convergence the algorithm is converging to the optimum exponentially fast right now let's go to our favorite set up quadratic surfaces a quadratic surface is a nice board right if I think of a quadratic function off of one variable then it's going to we will typically write it as half aid ax squared plus BX plus C the half is simply a matter of convenience so that when you take the second derivative it's a it's a okay yeah you wouldn't and we're just speaking of the rate of convergence that's all right Eiffel property for example if I if I know X star why do I even need to bother about going so you can get guarantees without knowing it and one of the cases where you can have such a nice guarantees for quadratic functions right so for quadratic functions I can start off at some point and the main mechanism is this is what we saw you're going to take a step in the direction against the gradient at each time right so the I wish I had a laser pointer worked ah you're trying you're trying to minimize that quadratic function half aw square plus B W plus C the derivative the second line on the slide is mentioning the gradient update rule they update the K by an of k plus one update is the kate update minus some step size times the derivative of the current location correct now yeah this is we are trying to estimate the scalar parameter W now what is the step size that will get you to this optimum fastest can you look at this and tell me that's to answer your question right here if I know the function I don't need to know where it's minimum I can tell you the step size where it will get there fastest for the quadratic and how is that the case I'll give you a car guys a couple of seconds to think about it and then I'll actually tell you throw it how would you decide yeah and what so what will the corresponding step size be optimal step size be the answer is right but what will the step size be to make sure that you get there in a single step so let's work it out okay now something is wrong this is I'm not sure what's this was supposed to be I keep editing this slide and every time I read the slide and save it this figure changes because Microsoft messes up on me so the figure to the left is supposed to be ball so let's not get to the slide okay I have you know f of X equals R ax square plus BX plus C that's going to look like so right now suppose I've met some current point X x0 my current estimate is x0 what is the Taylor series expansion of this function around X 0 everybody's familiar to our water Taylor series is right who doesn't know what a Taylor series is anybody who doesn't know what a Taylor series is so Lola you can tell me what would the Taylor series expansion of this be around x0 original function where X zero what is the second derivative going to be an okay I'll just write this is f double prime X zero you forgot the half correct perfect and do I need to expand it further here because they're only two it's a second order function the Taylor series approximation is second-order right now let me how does how do I solve for the optimum over here how would I solve for the optimum I have given you the function find the derivative and set it to zero is left prime X so if I what is the derivative of this guy this is going to be F prime X zero right plus two n 2 will cancel f double prime x0 X minus x0 is it correct yeah so this is this equals zero so what is X x equals minus of f prime X 0 by F double prime x0 that moves this over to the other side divides by this guy that's and plus x0 right makes sense we just sort of took a derivative and equated at a zero now if I claim I'm going to use now look at our gradient update rule our gradient update 1 is going to be x1 equals x x1 equals x0 minus beta times f Prime at x0 correct compare the two what step size will give me if you get me to the optimum in a single step probably so implies ito up equals 1 over double prime x0 right if I use that step size in a single step you're going to get to that no I'm not scaring you sorry but and I can do that without knowing what the optimum is ok so more generally de perak ardyss of both the step size if my step size is appropriate then for a quadratic function the convergence is linear this this can be easily shown okay so here the figure to the left this is the thank you Microsoft just imagine there's a ball and it's showing all the beautiful necessary parts this is basically what we found I've written Yi as my function out there but bass basically what you what you find is that the optimal step size is the inverse of the second derivative of the function but what is the second derivative of this guy anywhere what's the first derivative the first derivative is simply going to be ax plus B correct what is the second derivative a the second derivative doesn't depend on the location right which means that you actually can get to the solution in a single step from any current location the second derivative is simply going to be any worse yeah so you see how this works for a quadratic it's trivial now what would happen just keep this in mind observe that masu I have a bowl right I start off over here I'm going to use this rule XK plus 1 equals XK minus 8a times F prime of X this is my gradient descent rule ok so case 1 ETA equals and my function is whatever it's quadratic ok the grade the the ETA is M double Prime XK inverse if I do this where with the next and say this is my current estimate is f of X X and here if I use it the this this inverse of the second derivative of my as my step size where will the next estimate be hmm pardon me it's just a quadratic there's a quadratic right so yeah add the minimum ok that's going to get me suppose 8 is less than F double prime X K inverse remember this is a constant right so this is aimless f of X is 1/2 x squared plus BX plus C so what what would happen if the step size is less than the optimal step size where would the next step be on the left in this particular case it's going to be it's going to be here but obviously I'm not getting to the correct solution so the next one's going to be here next one's going to be here it's going to converge quickly to the solution correct make sense what if ETA is greater than f double-prime XK inverse what would happen okay now here is this there are D observe there are two conditions over here so initially let's say my eta is just slightly greater than the optimal step size what would happen this is going to go here right and then instead of coming here it's going to go here then I'm stopped coming here it's going to oscillate is there an 8r where the thing is going to blow up we could insult for ETA equals two times F double prime of X inverse this is going to go straight to this location correct twice the step remember it's a quadratic it's a symmetric ball if some step gets you to the optimum the same step is going to get you back to the same height on the other side because it's a quadratic right and so it's going to keep bouncing back and forth and back and forth what if it has greater than 2k 2 F double prime x painless this is actually going to diverge it's going to blow up right makes sense to everybody yeah so this is basically what is happening over here that's there on the slide for you see just a pictorial characterization of what would happen in this particular case okay now this is for quadratic functions what about for a generic differential function I give you a function remember we are always talking about in terms of quadratics but suppose I give you some random function except that I tell you this function has at least two derivatives everywhere then could we use the same behavior to characterize convergence of gradient descent for this function you gave me an ounce of partial answer some time ago without knowing yet okay anybody want to take a guess Taylor series right second order approximation I can take pretty much if I have some crazy function if I have some crazy function this is not this is convex but it's not really quadratic I can take the current location get a second order Taylor series approximation at that point which will be a quadratic right and I could look at that got that quadratic function to get an idea of what a reasonable step size might be right and so if I were just taking a second or any arbitrary function I can take a second order Taylor series expansion truncate it after two components to a to two terms three terms including the zeroth order term and then the optimum step size I can assume is going to be the inverse of the second derivative of the function at the current point and you can also assume that if you're more than twice the step size is more than twice this value you're going to blow up everybody with me okay now what if you have multivariate inputs so this is for a nice scaler right it's a scalar function of a scalar variable but now I give you a multivariate function so the lady in the corner how would you characterize a quadratic of two variables someone else what would a quadratic of two variables be looking look like hmm it's a ball I know of course it's a ball but what exactly is happening and I say it's a ball yeah but functionally what is the pro you know what is the functional form for it it is it is a single minimum of course right what's the functional form for it can you take a guess what's your name yeah yeah mihaela okay so how would you actually characterize a mathematically characterize a second ordered a quadratic of say two variables so right the functional form for it let me do it somebody else want to take a guess okay thank you okay say that I say that alarm x1 squared plus plus B x2 squared plus C X 1 DX 2 plus e right there's some time missing let's say if X 1 X 2 right there combination of the total number of terms that are combined in any term in any term is going to be upper bounded by quadratic I can write this in vector form in vector form I can put X 1 X 2 in a vector and then my quadratic is going to be 1/2 X transpose X plus B transpose X plus that's going to be my quadratic it's the same equation right any number of variables and now something interesting over here what does this actually decompose to now let's consider a very special case I have two variables X 1 and X 2 so the first variable has a quadratic half you know a 1 X 1 squared plus B 1 X plus C 1 the second quadratic has the form of e 2 X 2 squared plus B 2 X 2 Caicedo so I've just returned quadratics for each of these guys now for any value of x 1 x 2 regardless of X 2 this guy is going to have the same shape correct what would happen as X 2 changes the height of that function bowl is going to change that's all that changes right depending on the value of x 2 the height of the bowl is going to change similarly for any value of x 1 this guy is going to have the same shape what would happen as x1 changes the height of the bowl is going to change right so overall what do you think this joint function is going to look like it's going to look like a ball right but how would the ball be aligned it's going to be access aligned so here's what your function is actually going to look like if I think so that's your standard quadratic function e multivariate and this is a special case where the matrix a is diagonal if the matrix a is diagonally can simply write it as a sum of many different quadratics right it's all D you have uncoupled each one of the directions and we're going to get into you know what is this business of a Hessian if we're in a bit but if I now plot the overall function and if I look at it from the top this is what it's going to look like right this is a ball those are all equal value contours as the as it gets blue or it gets deeper I'm assuming that it's convex in every direction and so it's going to look like a bowl which is wide and wonder you know the bits in different directions are going to be different okay and if you actually slice it basically if I take any vertical slice they're all going to have the same shape the heights will be different any horizontal slice will all have the same shape the heights are going to be different now suppose I use a gradient descent rule and let's say this is one ball and this is the other ball right what is the optimal step size for this guy at any point speak of it yeah a 1 inverse right the optimal step size is going to take a 1 inverse over here this is going to be a 2 inverse what is the gradient descent rule the vector format when we are using the gradient descent rule for something like the gradient descent rule is going to be XK plus 1 equals XK minus a 2 times the gradient with respect to X f computed at X K right so the step size the eita is the same for every single direction see what I mean you are taking a step against the direction of steepest ascent of the function but then if you consider so what can you tell me about what would be a good step size the optimal step size is different for different directions right so what would happen if I took a 2 as my step size a 2 inverse is my step size what would happen I'd get to the optimum along X 2 in one step but what would happen along X 1 it's going to diverge it's going to blow up right what would happen if I took a 1 inverse as my optimal step size it's going to get to the optimum point in X 2 in X 1 in one step and X 2 it's going to take forever it's going to become very slow right so what is the optimal step size what is the what is the minimum criterion you can impose for a step size in this condition in this situation there it must be smaller than the smallest optimal step size or twice the smallest optimal step size right if it's if it's larger than twice the smallest optimal step size in at least one direction it's going to blow up right so what will happen here is the distance in this case are uncoupled the optimal step sizes are different in the two directions the optimum step size for each core because the optimum for each coordinate is not affected by the other coordinates in every coordinate the optimum values exactly the same right and so when you use a vector update tool of this kind basically in every direction you are using the same step size when you're using the vector update rule and this direction that you're walking in is at 90 degrees to the equal value contour remember the gradient is always orthogonal to the normal to the equal value contour right and so the problem with this rule is here's what will happen and let's skip that slide I have this example I have a a quadratic in two variables where the two variables are uncoupled so the optimum step size for the first coordinate is 1 for the second coordinate is 0.33 so which of these two represents the steeper ball can you tell me just by looking at data you want the steeper ball to have the smaller step size right so I would think the narrower side corresponds to a tie equals 0.33 you would expect the wider I want to correspond it equals 1 equals 1 right now in the first case I have used a step size where global step size where the step size is 2.1 times 0.33 so it's like points have point 6 6 plus point zero six a point 7 something ok what happens point seven something is less than the optimal step size in the first coordinate so it's sort of linearly converging to the solution along the first direction in the sec's second direction is blowing up right in the second case the optimal step size is exactly twice the optimal step size for either to what happens in the second direction is going to oscillate it's good exactly oscillate it's still converging linearly for the first direction in the second direction it's kind of it's bouncing around and it'll bounce around forever at the same two points right third case it's 1.5 times the optimal step size pareto now it will converge but will keep jumping across the optimum in the first direction it's still kind of converging linearly right fourth case it's exactly the optimum stress step size for the for the second coordinate so observe what happens in one step it has gotten to the optimum point for x2 but then x1 is getting there incredibly slowly right and if you try to keep it below the optimum step size in both cases and it's going to slowly converge to the solution it's going to take a very long time right so the step size matters and if you have something that's between you know the optimum step size and twice the optimum step size in every direction you're going to get behavior of this kind so convergence behaviors now this is only with two dimensions if I give you many many dimensions 700 dimensions 70,000 I mentioned seven million dimensions it's really hard for you to know what's going on right and the behaviors become increasingly unpredictable as the dimensions increase for the fastest convergence ideally the learning rate must be close to both the largest optimal rate step size and the smallest optimal step size because the closer you are to the optimal step size the faster you're going to get to the solution so if the ratio of this largest optimal step size and the smallest optimal step size is very large the function is going to take forever to converge right this is what we'll call the condition number the condition number is small the function will not converge now what's the reason for this problem first I have two figures over here you have many more slides that work that I'm not going to show what's the difference between those two figures without looking at the slide if I just gave you half you know X transpose ax plus BX plus C B transpose X plus C as my equation by looking at those two figures what can you tell me about a and the second case that a is not diagonally the space has been rotated correct so what is the Hessian of this function what's the Hessian of this function this is the second derivative right what if what would it be look at this thing what is the Hessian of this function it's just the matrix a right and now here's the point regardless of how you have you give me this matrix I can write a symmetric a version of this function where the a matrix is symmetric simply because I can transpose the whole thing and add the two and I get a symmetric a matrix okay so in other words this equation has always got a symmetric the Hessian is always going to be symmetric number one number two what are the eigenvectors of this Hessian has him look at the figure and tell me pardon me now eigenvectors so if this is like so these will be the eigenvectors the eigenvectors are going to be the directions of the principal axes of the of the bowl and what is special about the eigen vectors you can restate the entire equation in terms of variation in terms of this new coordinate system which is the eigenvectors of the hessian and if you do that what happens the directions become uncoupled which is when we are talking about you know quadratic you know optimizing higher order functions we're speaking of essence first your thought of saying okay i'm going to take the second order approximation I'm going to rotate it I'm going to talk about what happens along these directions and now I can treat the directions independently I can say that in this direction it's convex in this direction it's concave and other things like that right so the Hessian there are the eigenvectors of the hessian are just the principal directions of the of the ball okay so far so good but now why were we actually having problems where the thing could blow up the reason for this is that the eccentricities in the two directions are different right the bowl is shallow in one direction it's steep in the other direction how can you fix this anyone Alex how would you fix that exactly you can try to reshape the function I can stretch the page along the x-axis right what I would like to do is to transform the space in this manner so that now it has equal eccentricity in every direction and if I do that now the optimal step size in every single direction is going to be the same right so what I will do is to transform my space I have some I'm working with W I'm going to rewrite the equation where the equation in terms of a w hat where W hat has the form some s times W ok such that when I write the equation in terms of W hat I get this beautiful quadratic which is where the cross section is perfectly circular is an identity matrix right when a is an identity matrix what's the optimal step size it's always going to be 1 right so given this just compare this equation so let's just compare you don't even need to go in through the mat right let's just compare our X transpose X plus B transpose X plus C okay I'm writing x over here I meant W but you get the idea right and the second one is 1/2 X hat transpose extra X hat plus some B hat transpose X hat plus ok given this and I'm telling you the two equations are the same what is the relationship between X hat and X just by looking at it you can tell me X transpose ax equals X hat transpose X hat right you don't even have to think very hard implies X hat equals what a raise to 0.5 X right asymmetric I can just write it I don't even have to think very hard and so what you mean what so what we are really doing is we are sort of stretching this space we are transforming by the space by the square root of the multiplier for the quadratic term okay and let's keep that for the diagonal case so now if I just do that now here's the magic if I use that it doesn't matter whether my original original ellipses are axis aligned or not regardless of the shape of the original function whether you know the ellipses are perfectly aligned with axes or whether they're at an angle when I'm done transforming things in this manner the contours are going to be perfectly circular right a perfectly spherical and once I do that I can write my update rule in terms of the modified modified variable in terms of W hat and now I can translate it back and here I know that this ATAR the optimal step size is going to be one no no no issues about it correct we know this and then I can modify that and I can translate back from there to this guy into the original space and you know it's simply going to come back with a simple update equation I won't derive it I'll take too much time it's takes one minute of derivation but I think I'm being too slow today it's WK plus 1 equals WK minus I'm ETA times a inverse times the current gradient ok and the optimal eita again is one yeah now if I have a generic differentiable function this is for a quadratic right so for a quadratic I've got this guy for a quadratic I've got WT plus 1 equals wk - you don't even need the ETA it's going to be a inverse gradient with respect to W off suppose I have a generic differentiable function it's differentiable everywhere it's twice differentiable everywhere but it's not gone it's not quadratic what would a corresponding approximation be what would you do anyone what would you do in this case exactly what we did for the scalar case right and what did we do there when I had a non quadratic function what did we do follow me we took the second order Taylor series approximation correct and then we took in terms from the second order Taylor series approximation how will the multivariate case differ it shouldn't differ it's exactly the same thing I can still take the second order approximation and the second order approximation is going to now look like so the value of the function at XK plus the gradient times they're actually the derivative times you know W minus wk plus half of W minus W K times the Hessian times the this should be a transpose u basically this is just the Taylor series approximation in for where for multivariate functions if you had higher order terms they're going to invoke tensors but we're going to stop at the second order okay using this guy what would the update rule now be if I use this what would update will be want to tell me if I use the second order to approximation what would update will be just equated to the quadratic a late I have half an hour I'm gonna wait for you to answer me and the rest of the class well I mean so I want you to tell me what they corresponding for equivalent for this a bit rule is going to be I can see you telling me the answer go ahead finish it plug what in what will if you plug in the tiara series what will that will be yeah there's no heater but still that's what the function there is quadratic right so what is this going to be anybody else want to take a shot Alex that's it right the a the Curtis equivalent of the a is the Hessian when you're doing the quadratic right and so for this guy it's simply going to be WK plus 1 in cross W K minus now the Hessian at W K inverse and yeah the gradient would W add W K it's the same function nothing really changed I just took one quadratic it's a different quadratic instead of a you had the Hessian right this is Newton through this is exactly what Newton do sir Newton's rulers right so for generic differential apply multivariate convex functions the update rule is simply going to say current step minus the inverse of the Hessian times the current gradient that's optimal the ADA itself ideally is one but in more generically it won't be so you're more generic convex function is going to look like this you have some function with some funny inquiry equal value contours add the current location you're going to make a quadratic approximation in one step you will find the optimum of that approximation at that point now you're going to make a second quadratic approximation in one step you find that optimum of that and so on until you get a correct solution the problem with this approach now this is a beautiful approach it's going to be very very fast the problem with this approach if you have a complex model like a neural network with five million parameters how larger the Hessian matrix be someone and squared right it's going to be 10 raise to 12 you are not going to be computing the Hessian matrix leave alone the inverse of the Hessian matrix that's not going to happen so the in handa and there's a second issue I have a function like this this function actually has a nice it's like that you know it's that kind of shape to give you a one-dimensional equivalent see I have a function which looks like this so we found that when you have a function that looks like this and it's not quadrat it's not quadratic I can make a quadratic approximation go here then make a quadratic approximation go here and eventually you're going to find the minimum right but how about a function like this this function has a nice unique minimum but if I look at a quadratic approximation at this location is it going to be convex no the quadratic quadratic approximation is that guy if you find the minimum there's no minimum of it the minimum is wherever you wanted to be up at minus infinity or at plus infinity right there is no so when you begin looking making quadratic approximations of these functions you pick some location you might actually end up going in the wrong direction so you can't just use this method blindly because the function even though it's actually in some sense well behaved and it has a unique minimum you are your second order method can take you to the wrong places and just blow up in your face so what is what is a characteristic of a corner of a hessian function that's going to take you to a bad place it's not going to be positive negative definite positive definite right and there are a great many methods which try to deal with this problem first they deal with this in two ways a they try to estimate the Hessian in a proper way estimated in a computationally tractable way and secondly they try to somehow modify it so that it's positive definite so that you can keep continuing the algorithm regardless of the shape of the function so these are things like the BFGS Baudin fletcher gah Goldfarb Shanno or the low memory version of a tell BT BFGS you might have encountered these terms what are they really doing they're all just trying to various ways of trying to approximate the hessian or the Levenberg Marquardt where the estimate has scenes from jacobians and then you diagonally load these things to make sure that the eigenvalues are positive various things but for neural networks we don't use it simply because the whole thing is going to be tremendously inefficient but then there's a second issue we were trying to make sure that the learning rate actually gets you to the optimum right is this really necessary now consider a function of this kind let's say your initial point is out there where the arrows begin do you really want to go to a local minimum not really right initially you want to bounce around and go out of the local minimum if that's a unique quadratic if your step size is too large you're still guaranteed that you're going to stay within the ball but if it's not a quadratic and if you have all these ugly local minima then the step size is going to make sure that you go out of any local minima and begin wandering the space at large so what you really want is that you don't want to have the step size this characteristic 8a greater than to a top will help you escape local minima but you don't want it to remain greater than to adopt all the time because then you're never going to arrive anywhere it will never converge so the solution is start off with the large data twice more than twice a tarp and then shrink it keep making it smaller and smaller and if you keep making it smaller and smaller eventually at some point the data is going to be small enough that it's going to be less than twice the ADA opt of the current bowl we ask a question yeah at some point it's going and then you're going to end up in a sufficiently large bowl and then when you get to the bottom of that bowl that's almost certainly going to be a local minimum and in fact if you shrink they adopt you know slowly enough there are some guarantees yeah more than twice how many times I'm going to do it so the point is you can keep picking random points you will never know where you are so this is a more formal way of you know what we are doing is shrinking shrinking the step size at each time and at and as you go through the step size keeps decreasing so eventually you're going to begin converging okay so that would not happen if you're just speaking random points and then you need some tests to know they are in a bowl or not which don't really exist right yeah it's possible but unlikely you know in statistical terms it's unlikely you can keep bouncing around and eventually just by chance you get to a small enough weight out I mean a dot where you're in what some tiny little you know crater and this will happen it's not that it won't happen right but very very infrequent so how exactly do you decay the step size there are various kinds of step sizes that have been proposed in the literature but here is one step here is a standard technique that the standard criteria that have some provable properties a we want the summation of your step sizes to be infinity but you want the summation of the squares of your step sizes to be less than some constant finite what is it why do we need these two criteria if your step sizes do not sum to infinity if your if your if your derivatives are bounded then it means that wherever the starting point is I can draw a radius and I'm guaranteed that I'm not going to be searching outside that radius right so which means I won't actually be searching the entire space for a global optimum in order to be here for me to be able to span there scan the entire space for a global optimum I want this property to be satisfied because if it's not think about it if my derivative is say a upper bounded by 10 and if the sum of the 80s becomes 1 million then the farthest I am going to get from the current location is 10 million right if the optimum is a 10 million plus Epsilon I'm never going to find it so you want the sum of the it has to become be infinite on the other hand you want the it has to shrink so the sum of this cell one way to say it is sum of the square data's must keep become it must be finite right can you give me a function which has this property some function of K which has this property any 1 1 over K right the integral of 1 over K is infinite the integral of 1 over K squared is very very funny so typically what you will find is that most training algorithms you're going to start off with a large data continue that maybe for some time then shrink it and the overall property is going to satisfy both these conditions ok now so gradient descent can miss obvious answers and we various convergence issues about the law surface may have saddle points vanilla gradient descent won't converge etc etc right so and the optimal learning rate for one component may not be may be too high or too low for others second order methods normalized the variation across the different components to mitigate the problem of different optimal learning rates for different components but there is this requires computation of inverses of second order derivative matrices which is not feasible and may not even be stable yeah and the learning rate the divergence causing learning rates may not be a bad thing for huggly loss functions but decaying learning rates provide a good compromise between between escaping poor local minima and convergence now and many of the convergence issues that we find is because we XP force the same learning rate on every component of every parameter of the network so let's take a step back can we instead of having something of this kind can I just have different step sizes for different parameters the problem with this is this a you have to keep track of so many different parameters right be the even in the case of a quadratic let's say I have a quadratic that's at an angle like this right and I have difference too and I maintain different step sizes and different directions taking a step along this direction actually affects the value of the function in the other direction so when I begin locally optimizing each of the change setting the steps step sizes for the various directions they will begin conflicting so convergence cannot be and you might know might not converge bad things can happen you first take a step along X and see I've done something better then you take a step along Y and say I've gotten to a better place but the step along Y destroyed what you did along right so you can't just ignore all of these guys nevertheless various algorithms have been sort of proposed which try to try to use the derivative information for trends but they don't actually use them absolutely the problem with this having derivative based independent steps in each direction is that they can begin conflicting with each other right so on the other hand the derivative does tell you which direction you must move in to get to a better solution so you have algorithms which sort of retain this information but throw away the value of the derivative itself the two most popular ones that are prop and quick probably are trivial to implement amazingly effective if they've been around for 30 years and I'm surprised people don't use them you know frequently enough I'm not actually going to cover our prop I was originally going to but because you know we're kind of short of time I'm going to just give you the usual caveat that I suggest you actually study a part of and quick prop because we will have a nice little problem in the quiz which makes sure that you have understood what you're doing what our progress I'll spare you a quick prop okay now so but this are dropping quick a quick drop methods a decoupled dimensions can improve convergence now let's take a look closer look at the convergence problem with dimension independent learning rates the solution will converge in some directions but oscillate in others so here is a very common solution something that actually works and in some case probably so at least for the case of quadratics instead of trying to do dimension independent learning rates for every dimension and just doing gradient descent which we said can actually end up with bad behavior you keep track of the oscillations you emphasize steps in the directions that convert slowly you reduce shrink the steps in the directions that converge fast so to give you an idea so these are these are the so-called momentum methods and the most common way of doing it is that you maintain a running average of all the past steps so for example the dark red line shows you the actual actual sequence of steps the so what would happen is if you're let's say you keep a running average of everything that you've done so far if the behavior is like that then keeping track of everything that has happened so far you will realize that in the x-direction you are swinging around but in the y-direction you are progressing slowly right so the simple thing is ok let me reduce the step size in the X direction so that I reduce the swinging around in the X direction in the Y direction I'm smooth let me increase the step size in the y direction right simple logic and so you basically and this you can do by just maintaining a running average if I maintain a running average of all of the previous steps if I'm bouncing around in the X direction the average is going to the positive and the negative steps will cancel out if I'm consistent in the y direction they're always positive they were they will add up and so in directions in which the convergence is smooth the average will have a large value in the directions in which the convergence is not smooth the average will cancel out and you can use this behavior to decide what the step size must be in each direction so here's the basic momentum update rule the momentum update rule maintains a running average of all gradients until the current step so let's say you have your standard gradient descent that's going to have behavior like the one to the left where the direction in which you move is always orthogonal to the equal value contour but then the one to the right what you would do is this you are going to add each time the step size is going to be some linear combination of the step size in the previous step and the current gradient so you're going against the current gradient but now you're not behaving depending turning on the gradient you are averaging this with the previous step size you are maintaining a running average run the running maintaining the running average is the same as maintaining computing a global average we know that right and so what will happen over here is that as a result the you're going to be cancelling out oscillations in some directions enhancing oscillations and are enhancing the step size in other directions and then finally this updated step is what you're going to use a correction is what you're going to use to update your parameter value very simple it's a trivial thing it has some very nice properties some of them are provable under conditions right and it is going to get you to the solution much faster so how does it change your overall algorithm that's trade Ian's training with gradient descent what you would be doing is that you'd be computing the gradients keeping an average of the gradient and just doing a subtracting the subtracting some step size times the gradient right with momentum things change all you have to do is now you keep track of the step size so at each time you you compute the derivative the first thing you do you change the change in coders minimum the first thing you do is to compute a running average of the current derivative and the previous step size and that's what you're going to add to the current estimate to get the next estimate so the momentum method the way you do it at the at each time you compute the current step right so let's say you are in the new location you let's say you're on the first dot you made a move you got into the second dot then at the cut then at that dot you would compute the gradient which which might be like so right but then you don't actually just keep that gradient you take that some scaled version of the gradient plus some scaled version of the previous step and that is going to give you your resultant step so now you're going to repeat the same process over here having come here now at this point they're horizontal line neo horizontal line is going to be a current step you'd compute a gradient which is probably going to be orthogonal to it so the average of the two is going to go slightly upwards and that's where you would move right so what is happening is you're computing the gradient then adding the average of the path step to get to the next step so they takes a step along the past running average so what it's doing is observe what is happening over here the actual step has two components to it you can break it down into two steps the black line is actually two steps first you take a step in the blue direction then you take a step in the reduced red red red direction and that's going to give you a net update right so first you move along the gradient then you move along a scaled version of the previous step and that's going to get you to the current estimate it turns out that if you change this order things get much better you just reverse the order of operations so what would you do over here this is an ester of accelerated gradient it's a brilliant paper and I remember I don't know exactly member exactly which paper this was one of Nessarose paper the paper starts off with obviously and it makes a statement and then has a page and a half of math afterwards afterwards which says QED it took me two days to understand why that obvious thing was obvious that is the first line of the paper all right but anyway that's an issue with mathematicians particularly Russian mathematicians stuff that's obvious to them may not be obvious to you but here's how it happens so let's say this is the current step right now instead of computing the gradient at the current location you just take a step forward which is a which is a reduced version of the current step then compute the gradient and that is going to give you a resultant step and that is going to be the final update right so Nesterov method is like so you're going to the current step the neck step is some beta times the current step right because you're actually taking a step in this extending the current direction plus or minus some 8a times the gradient computed at a location where you arrived after having taken the current step yeah and then you make a correction exactly as before so comparison this is a very nice comparison of the two by Hinton so you can see what happens with nesterova method versus the gradient descent method with the momentum method the momentum we already saw is better than simple gradient descent right the blue line is what happens with momentum let's say you've taken an initial black step then you're going to compute the gradient over there which is the short blue line then you're going to add a scaled version of the previous step and then you would end up on that line out there going up and if you keep extending it it's going to take several steps to get to the optimum in Nessarose method you extend the black line first then compute the gradient and then take the final green step and you can see it's getting to the optimum much faster so you can so Nesterov has this nice proof it shows that this is really the optimal thing to be doing so training with Nessarose method the because of this peculiar we are peculiar operation where you first take a step extend the current step and then compute the gradient to obtain the final correction you know current step the arithmetic change is a little bit so as before you'd compute what you would do is to actually first take a step so you're going to go from WK you say WK equals W k plus beta times this step right and then you compute the gradient at that point and then take a step back against the gradient and then compute the updated and then you find out what the real modification was right so I'll momentum and trend based methods we can return to this soon the story so far we've been talking about convergence vanilla gradient descent may be too slow gradient descent can miss obvious answers we already saw that this may be a good thing vanilla gradient descent may be too slow or unstable due to differences between their dimensions second-order methods can normalize the variation across dimensions but our complex adaptive or decaying learning rates can improve convergence and methods that decouple the dimensions like our prop and quick prop can improve convergence momentum methods which emphasize directions are steady improvement are demonstrably superior to our the methods coming up in the next class actually I'm going to be covering some of this and Friday in a recorded lecture simply because I missed the previous one and the final lecture on training will be next Monday by the way I hope you attended Mike towers lecture if you haven't please look at his video might has one of the best first it in the field and what he does also he's very open to doing projects and if you actually work with him he has some very interesting ideas very fundamental stuff one on computational modeling of the workings of the brain so reach out to him and I'm sure he'd be very happy to actually work with you guys and you have high odds of a a high-quality publication in a somewhat orthogonal direction right everybody is publishing in AI CLR or you know I CML but that's going to give you exposure to a different way of thinking which could be very interesting questions no thank you and in the next class I hope to see a few more people in the class this is really sad

Info

Channel: Carnegie Mellon University Deep Learning

Views: 4,990

Rating: 4.9436622 out of 5

Keywords:

Id: sd7qhTKIi4Y

Channel Id: undefined

Length: 82min 15sec (4935 seconds)

Published: Wed Sep 18 2019