Unconstrained Optimization

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everyone and welcome to another video so today we are continuing our discussion on optimization by focusing on unconstrained optimization so this is the second video in our series focusing on mathematical optimization the first video was where we introduced the concept of optimization and laid some of the groundwork and necessary infrastructure for this discussion so if you haven't had a chance to see our introduction optimization video please take a moment stop watch that video first before continuing on so if you've already watched it i'll assume that you're familiar with some of these concepts and what i want to talk again about today is like we said we're going to be focusing on unconstrained optimization so the outline is i want to look at a very simple one-dimensional example to start off which we will then extend to the concept of finding higher order extrema and aka higher order local mins max global men's maxes things like that and that's going to set the stage for trying to understand what are some of the conditions necessary for optimality once we understand the conditions of how to find optimality in an unconstrained problem that's going to lead to a discussion on how to go about rigorously and analytically attempting to find these minima we're going to see that while this is going to work nice and theory there may be some practical implementation issues that i want to cover at the end of the discussion so that's the plan so if that sounds like more fun than a seasons past at disneyland let's jump over the whiteboard and get started all right so let's do a quick refresher on uh what do we mean when we say unconstrained optimization so if you remember from our previous discussion where we introduced the concept of optimization we said the whole idea with optimization is that you have some cost function f of x right and alternatively we said this is sometimes referred to like we said as your cost function j right and the idea with optimization is i want to try to minimize this function um subject to certain constraints so i want to say i minimize this cost function such that uh let's just say i'm going to call it like constraint one constraint all the way to constraint and are all satisfied right that's the overall idea with what we called earlier constrained optimization and the thing that made it constrained was the fact that you had these constraints being active right so again this was the problem set up you want to try to find the minimum value of this cost function such that these constraints are are satisfied now if we don't have any of these constraints right these are all gone now we have an unconstrained optimization problem right because they are literally no constraints so the problem now becomes sometimes you'll see this thing written as minimize this cost function such that this x value it's it's you're completely free to pick whatever you want right because there's no con the feasible set is the entire rn space so you'll sometimes see this is x needs to be in rn so again this is basically saying you are free to choose x to be any vector decision vector you want so there's literally no constraints therefore it is an unconstrained optimization problem right and again what you'll typically see is i probably should have drawn this a little bit over let me let me scoot this over just to have a little bit more space okay so if you solve this you will find the minimal cost of this cost function right that is j star so in this case j star is the minimum cost function value right it's basically the lowest point on the cost function uh in the entire space right so no matter what x value you choose that's the minimum value now what's almost more important in optimization problems is not just finding the minimum cost function value but really i care about finding the minimum the x value that yields this minimum cost function value right so a lot of times what you'll see is uh is well maybe let's write this underneath here the almost more important thing is again i want to still try to minimize this cost function f x such that x is in r n right it can be anything it wants but i don't really care about j star i care about the x star value which corresponds to this j star cost function so sometimes you'll see this written as i really want the argument that minimizes this cost function that is x star right that is the optimal x so here x star is the optimal uh x value that yields j star right okay so the game plan for today is i want to talk about how are we going to find these x stars right how are we going to find these magic points which are minimums of this cost function as we as we move along all right um the other thing that maybe we should mention is again like we talked about in the previous lecture we're only going to be talking about minimization problems not maximization problems because as we saw earlier you can basically easily translate a maximization problem into a minimization problem by just multiplying the objective function by a negative one right so for the for the most of the rest of the discussion today let's just focus on minimization problems so um the idea with these are actually really simple right let's think about a one-dimensional case to start and i think you'll see that this is a you know conceptually unconstrained minimization of a one-dimensional cost function is pretty pretty simple um again at least conceptually right so the picture that goes along with this is again if you have some x and then you've got a cost function value f of x right and if this thing looks something like i don't know i'm just going to make this up [Applause] right something like this right um conceptually this is not hard to think about what we're trying to do uh all we're trying to do is we're trying to find minimums of this cost function so that looks to me like here's one right here's a whole set of them right and then here's another one right again visually it's very easy to pick up where the minimums of these functions are right of this function all right so here is a uh sometimes what's referred to as a strict local min or a minimum right here this entire region of x this here any of these values is a local min right again to contrast that with strict local minimum where there's just basically a single point versus any of these points there's a whole set of these points which are a local min and then over here is again maybe what we might refer to as a global or actually maybe to add even more adjectives how about a strict global minimum right because it appears to be the minimum over the entire domain of x and remember but for our unconstrained problem x is the entire space the one-dimensional number line is we're allowed to pick any value in there right so uh these are all these x-star points that we're talking about right here's x star any of these points could be an x star this one point over here could be an x star right um and conceptually i think everyone will probably remember that from elementary calculus or by just looking at this graph right we see that the property of all these x stars is that the slope or the first derivative of the cost function at these points x star is equal to zero right so we know all you need to basically do is find points that satisfy df dx right evaluated at this point x star is equal to zero right that's basically all we're doing right is you're finding locations where the slope is equal to zero right now the only slight issue with this right is okay if we go around and find where the slope is equal to zero the places that uh i may be kind of kind of hand waved over are look at this up here right isn't this also something that satisfies one of these x stars right at least it satisfies this first order check right so we see that both minimums and maxima right both have the property that the first derivative of the function at that location is equal to zero so now we have to go one step further and ask ourselves well how do i differentiate between the red x stars and the green x stars right i think again basically from elementary calculus i think everyone will remember okay i can't just look at the first derivative any longer i need to now apply a second derivative test right so let's go ahead and apply the second derivative uh test right where we basically now that we have these x star values we need to ask ourselves okay what does df uh dx squared look like at x star right what's the second derivative of it well if this is less than zero right it basically means what does that mean we have a kind of like this concave down shape basically meaning that what this x star is a local maxima right so if it's less than zero maybe we should say it like this so if this is zero less than zero then f has local max at x star right and maybe we'll uh let's see yes and then similarly if the second partial derivative our second derivative sorry we're not at the partials yet we're getting there right if this is strictly greater than zero then f has a local min at x star right because now in this case you have a concave up sort of picture right and then finally if the derivative the second derivative of this at the point x star is equal to zero then the test is inconclusive and we may have to do additional checking to see what exactly is going on and we're going to look at an example of that in fact why don't we look at an example of all of this right now so let's go ahead and consider let's make up a function so how about some cost function that looks like f of x is uh negative x squared minus x cubed over three plus x to the fourth over four okay this is a the function i want to to take a gander at okay and uh just for giggles maybe let's go ahead and draw it over here just so we can kind of visualize what it looks like so let's look at f of x versus x and i think what this thing looks like i'm going to just try to sketch it oh boy let me see how am i going to do this without completely butchering this i guess later on i'm going to show you a screenshot so you don't have to watch me try to fumble along with hand sketching things but for now just so we can talk about this conceptually this function sort of looks like [Applause] okay so something something like this again i'll please apologize i'm gonna apologize in advance for this butchering of the actual function but it looks like this right so here's what the function looks like okay so the first thing we got to do is we have to find these green let's go let's put them in green here so green point oh jesus was supposed to be lined up with minus one let me let me see if i can make this a little bit better [Applause] okay so there's a green point right here there's obviously a green point right here where the slope is zero and then there's another green point down here where the slope is zero okay so we have to go ahead and find these so again to do that all we need to do is take the derivative of this and set it equal to zero right isn't that the first step okay so let's do that so df dx right just take the first derivative of this function uh i think you're going to come up with this is uh minus 2x minus whoops minus x squared plus x cubed okay all right so now solve for the roots of this cubic polynomial and we end up with there are three possibilities for x star so let's call this x star i don't know solution one this is going to be here at negative one then you have x star solution two there's another one here at zero and then finally x star solution three is at positive two right so again that's for these green dots we see that at those places of x is minus 1 0 and positive 2 the function's slope is equal to 0. okay so great now that we have these three uh points we can now go ahead and apply our second derivative test to figure out are these local minimum maxima or something else okay so in order to do that i need to calculate the second derivative of f with respect to x okay so the second derivative again just take the derivative of this thing right so we end up with what negative two minus two x plus three x squared okay so uh let's go ahead and plot those actually gosh sorry i probably should have plotted these two let's plot both of these i should have plotted df dx sorry let's do that right now df dx versus x right if you plot that this thing looks like what it's got a pretty negative slope here and then we know that it's got to hit zero here and then it's got to hit zero here again and it's got to hit zero over here okay so we know the function passes through here it looks basically like this okay so that's the second or the first derivative now let's also plot the second derivative here so this is d squared f dx squared of x versus x okay and if you plot this again it's going to look something like like this okay so again green critical points are here here here and then again here here and here okay so if we go ahead and evaluate this function at these starred locations we can have df dx squared at x star solution 1 right if you just plug in the values of negative 1 into here i think we end up with g wiz what do we end up with i just want to make sure for completeness this comes out to positive 3 and then the df squared dx squared at x star solution uh 2 comes out to negative 2 and then finally df squared dx squared at x star solution three this comes out to positive six okay so we see that according to our second derivative test this first x-star solution is positive so it's here so this has to be a local minima right so this one is a local minima right similarly this one right here its second derivative is strictly negative so we end up in this first case so we see that it's a local maxima right and finally this third solution is the same situation as the first so this is a local minima right so again that makes sense we have a min we have a max here and then we have another minimum right there right and again all of this jives this all seems to make sense and this is somewhat reasonable um now what's interesting though is if we change the function just slightly we can get different results so for example i'll leave this as sort of a fun exercise to the reader in fact maybe what we'll do is let me just write down the new function and then we'll just jump over to mathematic and maybe we can look at all these plots at the same time but let's go ahead and change this if you change this function to something else let's say here let me let me just have it up here so this is x minus 1 cubed times the quantity x plus 2. okay so let's change the function go through this entire analysis in mathematic and we'll look at how the plots change and see how this one actually ends up with an inconclusive test okay all right so let's jump over to mathematic and do this quickly all right so here we are in mathematica and uh i've just plotted the three plots that i was trying to sketch on the board hopefully this is a little bit more clear but again here you see the cost function f of x and then we took its derivative and we solve for where it's equal to zero and as we saw that's at negative one zero and two and then by the second derivative test we see that at x star of negative one we have a positive second derivative and then at x star is zero we have a negative and then at x star of positive two we have a positive second derivative so therefore this is like we said uh minima maxima minima right now let me scroll back up and let me redo this analysis script using the updated cost function we were just talking about so let's go ahead and do the analysis and go ahead and see all right here's what the first derivative of the cost function now looks like we then go ahead and solve it for uh the values of x which which yield zero or so these are the x star location so we have negative 1.25 positive one and oh look it's interesting it's actually repeated there are two roots here at positive one so really there's only two of these points we need to look at negative five fourths and positive one okay uh now what we can do is i guess this is just for giggles we can evaluate the cost function at these locations just so we can see where the cost function is sitting it really doesn't matter um because we're not interested in that instead what we are interested in is like we said the second derivative so if i take the second derivative of the cost function with respect to x you get this and now evaluating this second derivative at these x star locations of negative 5 4 and positive 1 we see we end up with a positive number so we know that this location here at negative five fourths is a what is that that is a concave up so this has got to be a local minima here right so this is a minima but now the second derivative at this other location of x star of positive one they're both zero right it's 0 at that location so this is inconclusive at this point and the reason why is if you want to look at the graph let's look at what it looks like now so again here this top graph is the cost function right and here's the first derivative so we clearly see here's the two roots of the first derivative so here's the x star at minus 1.25 we clearly see that is a local minima because the second derivative of it is strictly positive at that location but this critical x star location of positive one we see that the second derivative is it's exactly zero and what that corresponds to as you can kind of see here is this sort of one-dimensional saddle point effectively where if you go in this direction the function appears to be decreasing but if you go in the other direction the uh function appears to be increasing so we get this sort of inflection point notation so we see that it's a little bit unclear what's going on in this case so again just a little bit of a note we have to be a little bit careful in some of these situations where the second derivative is equal to zero okay so that hopefully sets the stage and uh gives us a rough idea of how this works in one dimension what if we want to extend this to a multi-dimensional function all right so i think we understand how the one-dimensional example uh and concepts work let's see if we can extend this now to higher dimensional functions so uh what do what do we mean when we have higher dimensional extrema namely higher dimensional maxima or minima so um let's just start with a general definition of uh what an extrema might be so let's consider some region b where this function f our cost function is defined over so if the cost function is basically increasing or larger than the cost function value at some point x star for every single value of x in this set b then that basically implies right that this point x star is is it's a minimum over this set b right doesn't that make sense because anywhere you move away from x star as long as you're in this set b the cost function value stays the same or gets higher right or increases right so x is basically a minimum uh at that location right and similarly you can you can flip this around if f of x is less than or equal to x star for all x and b then that means x is a maximum right in b right it's a very similar idea it's basically saying that you're no worse than your neighbors in this set b right as long as you're at this point x star right so now let's uh this is sort of a more general definition that might be useful later when we started talking about constrained optimization because this set b might be something odd or uh you know a a constrained set but again we're talking about unconstrained optimization so this region b is really the entire space uh all right so we can start talking about definitions in a little bit more uh simplified term so basically what we can do is we can do something equivalent where we can start asking can we quantify that the function value doesn't get worse in some neighborhood so what we can do is we can basically say that for unconstrained problems situations right let's go ahead and say that as long as the cost function value at that location is less than or equal to f of x right for all x in uh whoops sorry sorry with all x with the norm of x minus x star less than epsilon so uh what we need to look for here is for an unconstrained situations there needs to exist um and some epsilon greater than zero right such that this is true and again this is basically saying this situation up here right that as long as you are within some finite neighborhood of x star right x is going to be within some some ball or some neighborhood of x star for any value that you go in there the cost function value increases right so what this is basically saying is that uh this is basically the concept of an unconstrained local minima right so this is now basically if this is true this implies that right x star is an unconstrained oops local minima right now and we can extend this idea that if this is true if this is this is also true for everything right so if you say this is true for all x in the entire space rn right this now implies that x star is an unconstrained global minima right so again all this is basically saying that is if you're sitting at some location x star right and you have some cost function value of f of x star right and now you're going to move in any direction away from this location x star right um we want to see that the function value is going to increase either within some neighborhood right if this is your epsilon neighborhood that means if as if no matter where you go in this neighborhood the cost function value increases you're at the basically you're at the bottom of the bowl all right you're at the bottom of cost so this has got to be an unconstrained local minimum and again if your bowl is the entire space that is the definition of a global minima right okay so with this kind of concept in mind we can start trying to formulate what are some of the conditions of optimality for the unconstrained problem all right so now that we've got the concept of just what is a higher order min or max let's talk about how we're going to generate some conditions uh that go along with those uh locations uh so quick side note we are going to be leveraging taylor series uh framework and infrastructure and tooling for the majority of this discussion here on conditions for optimality so if you haven't seen our previous video discussing the taylor series please take a moment pause go back watch that video because like i said we're going to be leveraging a lot of those tools namely what we said is if you have some function f right and we want to go ahead and um approximate this function we can approximate this function by doing a taylor series expansion of this function about some point a and this is given as the function evaluated at the point a plus the uh gradient of the function uh at the location a and again for dimensionality with respect i'm going to go ahead and transpose this and then we say x minus a right plus you can keep going with higher order terms right in fact we saw in the taylor series that this goes on for infinity right this goes on for a lot of terms let's truncate this at a first order approximation which is why i've got the uh approximate symbol right here okay and furthermore this expansion point a it can be anywhere let's go ahead and make this expansion point a to be i'm just going to rename it to x star right it's just some candidate location or candidate place that i want to expand the function about okay all right now again what we're interested in is let's assume what if this x star is a is one of these optimal uh min or max locations right is there something special about the system at this point x star okay so to do that let's go ahead and first consider small variations from this location uh x star so let's go and consider small variations from x star namely i want to move a little bit away from x star so in other words let's go ahead and say i want to look at locations x which are basically x star plus some small delta x right so here this is the small variation or the small delta away from xr so i'm considering locations x which are at the location x star plus a small delta so let's go and substitute this in for x in our previous expression and see what we end up with right so we end up with what this is f of um x star plus delta x right is approximately equal to f of x star right plus the gradient of f of x star transpose times okay i got to substitute the x here because i want to look at the location x star plus delta x minus x star great okay so all right this cancels this cancels also let's go ahead move this to the other side right so we can basically write this thing as f of x star plus delta x minus f of x star is approximately equal to the gradient of f of x star transpose times delta x okay now let's take a look at this this left hand side of the equation right this here is the cost function value at a location delta x away from this from from x star and here's the cost function value at x star so this over here is really the entire change right in cost function value right um when you move delta x away from x star right so if you move delta x away from x star and the cost function is going to change this number can either be positive or it can be negative or it can be zero it can stay the same right so the left-hand side is basically measuring the the change in cost function value right now if you think about that long enough basically in combining that with our discussion earlier about this the idea of you want to make sure that the cost function value in order for it to be a minima you got to make sure that the cost function value increases or stays the same no matter which direction you move away in the neighborhood of x star right so long story short we need this left hand side right this has to be uh it's got to be great it can't be negative right so this has to be all this has to be greater than or equal to zero right we need this whole left side to be greater than or equal to zero in order for this x star to be a minimum right now what's interesting is you come over here and you look at this right and we see that okay delta x is arbitrary right this is our b tray right you can move in any direction away from x star so the only way that this whole thing is going to be greater than or equal to zero right is we actually need this gradient to be exactly equal to 0 right because this is a fixed constant right this is this is the constant right it's the gradient at x star right this cannot change but this is completely free to change it can be anything it could be positive negative it could be any direction right so the only way the left hand side is always going to be guaranteed to be greater than or equal to zero is if this is exactly equal to zero right okay so uh that basically leads us to to this condition right we need the gradient of f of x star to be equal to zero right and again this is a zero vector right remember our discussion on gradient right this is a basically a n dimensional vector it's got to be equal to as n dimensional vector of all zeros right so if this is the case all these x stars that satisfy this relationship are referred to as stationary points so we should probably write that down so let's go ahead and um call uh we'll make a definition here of stationary points right these are these very special points these are defined as all of the x's such or i guess we should call x stars maybe that's the use similar notation right such that the gradient of f of x star is equal to zero right um again very important concept uh stationary points are basically any magic values of x star such that the gradient is equal to zero sometimes you may see these also referred to as critical points that's just a different nomenclature i think an optimization stationary points are is usually the more common uh nomenclature in verbiage so you'll probably see that but again the concept is is actually really not that difficult and actually if you think about this long enough compare this to our one dimensional example right we talked about in that 1d case the way we find these minima or maximus are we look for where the first derivative is equal to zero right isn't that exactly what the gradient is the gradient is basically a multi-dimensional first derivative you can kind of think of it that way right so this concept of stationary points of finding where the gradient is equal to zero is basically the higher multi-dimensional equivalent of finding where the derivative of the function equals zero right so again this has nice parallels back to the one-dimensional example we're just extending it for a multi-dimensional function at this point so the 1d test or the first derivative test so to speak is now a gradient equal to 0 test that is going to allow us to identify these critical points right so um what this basically gives us here is an interesting uh corollary that we can basically make the the case here that if you have x star uh is an extremum right basically again extreme this could be a max or a min we don't quite know just like how we didn't know in the 1d case if this was a maximum just looking at the first derivative but what we do know is if this is an extremum this does imply right that the gradient of f of x star is equal to zero right now we got to be a little bit careful that this is a sort of a one-dimensional implication meaning that it's a necessary condition but it's not sufficient what i'm basically saying is that the opposite is not true so if you have gradient of f of x star is equal to zero this does not imply maybe maybe i should let me just write this out this does not i'm going to underline that imply that x star is an extremum right you might be scratching your head but again this is basically what we looked at earlier with that condition of the saddle point i mean a very quick example of this let's go let's do a one dimensional simple case of this the the classic example of this is f of x is just equal to x cubed right here's your cost function value so we can easily find the stationary points of this so again the gradient of f of x in this case it's the same thing as df dx right so that's just what three x squared so it's very easy to find our stationary points right so x star is basically just zero right that's the only stationary point okay so now you look at the gradient here and you clearly see that yeah the gradient of f of x star is basically zero right that's that's exactly by construction what it means so you plot this function right of f of x versus x right and this thing looks like this right so we're looking at this critical point or the stationary point right here and we see that that is not an extremum right it's this weird kind of inflection point or saddle point here right so we see that even though the gradient is zero it doesn't imply that this is an extreme here's your counter example right here right so again this is a one-dimensional case the multi-dimensional case is um that these points where you have x where the gradient of f of x star is equal to zero and you can run into situations where it is not an extreme just like we saw here in in in higher dimensions these are typically referred to as saddle points i'll show you an example of it here i'll just flash up a screenshot of a case where as you can see in one dimension or in certain directions the function increases in other directions the function decreases and a lot of times these these surfaces that they generate look like a horse's saddle right something that you would sit on when you're riding along on a horsey um so this is where it gets its name is this is sometimes referred to as as a saddle point right so as we saw this first derivative test right this is equivalently the multi-dimensional level of the first derivative test isn't sufficient to tell us if the extremum is a maxima or a minima or something else entirely we need to go to a second uh basically a multi-dimensional version of the second derivative test that we looked at earlier okay so give me a second to erase the board and uh let's do that next all right so the way we're going to go about doing that is using our good old friend uh the taylor series expansion and look again looking at small variations away from the expansion point which is our candidate critical or stationary point x star now the only difference is again this is the exact same thing we had on the board earlier except now instead of a first order approximation of the cost function at the point x star let's expand this to a second order so let's add on the second order term in our taylor series expansion right so that is um what is it it's 1 over 2 factorial times x minus x star transpose times the hessian now uh of f of x evaluated at x star times x minus x star right okay um plus again higher order terms right which we are going to go ahead and neglect okay because we're just looking at now our second order expansion of our higher order function uh f right expand it again around the point x star our stationary point okay so now let's do the same thing we did last time let's consider small variations away from x star by moving a distance delta x away from it okay so let's go ahead and plug that in so again we end up with a very similar uh expression um well i'll tell you what in the interest of not skipping steps let's just go ahead and do it right oops this is x star plus delta x right so that's the left hand side is approximately equal to f of x star right all this is the same that we saw last time so grad f of x star transpose times uh okay so x now is x star plus delta x right minus x star great plus okay now there's our second order term one over two factorial again x we are now substituting in x star plus delta x right minus x star transpose times now hessian f of x star times again this is now x star plus delta x minus x star great okay so now we again we see a lot of these x stars cancel all over the place okay and again let's move this term to the other side so we're going to move that over here so we end up with again f of x star plus delta x minus f of x star right is approximately equal to gradient of f of x star transpose times delta x plus now here's our hessian term right one over two factorial now this actually comes it comes out quite nicely this is right just delta x transpose times hessian of f of f x star times delta x right that's what we end up with right now again a couple things to consider remember earlier we showed that uh for these stationary points right the whole idea with the stationary points were that gradient of f of x star is equal to zero right so that knocks out this entire term here this is zero right and again same thing like we saw last time the left side this is basically the change in cost function right value as you move away from the point x star as you move away from one of these stationary points right so we end up with um again we end up with the change in the cost function is basically just equal to this term over here so we make the exact same argument that in order for this uh this location x star in order for it to be a local minima right we need the change has to be this whole side has to be greater than or equal to zero right therefore meaning that this entire term here this has to be greater than or equal to zero right because again x delta x can be completely arbitrary right so what that means is we need this left side we need the one over two factorial times delta x transpose great uh hessian f of x star times delta whoops oh sorry this is supposed to be a no no i have that right oh yeah yeah yeah times delta x this thing has to be greater than or equal to zero so again we can we can neglect this one over two factorial right i'll just move it off to the other side so really this is the condition we're looking for right so this is what we need in order for uh we need this to be true for all delta x right in order for this point x star to be a uh a local minima right so this is actually very interesting we should ask ourselves what is this thing right if you look at this long enough this is basically the condition on uh this is a a condition that on this hessian matrix so again just to double check uh again everyone should have watched the taylor series lecture already so everyone knows that this this hessian of f of x right this is basically a matrix it's a symmetric matrix of mixed partial derivatives right so the first one is this is the partial of f with respect to x one and then uh you do it again right so this is uh how how do we wanna write this uh so yeah maybe i should do it like this so this is df dx1 and then it's df dx1 again right and then you go df dx um this is the mixed partial so this is now yeah d x so this is now 1 2 right etcetera etcetera all the way down to d f d x n d f d x n right so it's basically it's a symmetric matrix of second derivative second partial derivatives is all the mixed second partial derivatives of it right but long story short it's just a matrix right so this condition where we need this matrix to be multiplied by any arbitrary uh vector delta x in the front and and behind of it right this is basically uh the equivalent condition of saying that this matrix the hessian has to be positive semi-definite okay so what that is it's basically saying that uh we're gonna have some room here yeah the condition of delta oops sorry that should be a delta right delta x transpose times the hessian matrix times delta x right this thing is greater than or equal to zero right for all delta x right this is basically equivalent and happens right if and only if this matrix of the hessian of x star is positive semi-definite right this is basically the definition of a positive semi-definite matrix is that no matter what you what vector if you pre-multiply and post-multiply that matrix by it the quantity you're left with is always going to be positive or equal to zero right that's the definition of a positive semi-definite matrix right so um what we're seeing right here is uh once again we're basically seeing a multi-dimensional higher level uh discussion of what we saw earlier as that second derivative test right so this is basically now uh yeah it's it's the second partial derivative test is now we need to look at this this hessian matrix and try to assess is it positive semi-definite is it negative definite and what's the definiteness of this hessian matrix and that hessian matrix is going to tell us um what is the status of this stationary point x star that we were looking at right okay so let's take a look at that now all right so let's collect all of this into a theorem for uh local optimality conditions for these unconstrained optimization problems so again we're going to consider our cost function which maps from r and r right to be uh twice differentiable right of class c2 okay and we again we want to consider the unconstrained optimization problem of just trying to find minima of this function over any particular values of x and rn right so then based on this discussion we just saw earlier is that a necessary condition for a point x star to be locally optimal is that is if it is a locally optimal solution right then we know that it's gradient is zero right we know that it's a stationary point and furthermore we know that the hessian at that point x star is positive semi-definite right the sufficient condition then is also that if x star is a stationary point and if the hessian is positive definite now in this case then we know that x star is basically a strict locally optimal solution right so if you combine all of this this is basically sometimes you'll see this written down as uh it's basically the second partial derivative test which is the higher order version of the second derivative test we saw earlier which is basically uh you can think about writing this down or trying to come up with a very simple way to check for these conditions it's it's pretty simple right so a very simple algorithm is say you know step one is find uh x star such that right gradient of f of x star equals zero right so basically step one is just find all the stationary points that you're interested in and then in step two right we're basically going to go ahead and cycle through all of these stationary points so let's say for all stationary points uh x star right we basically need to just determine the uh definiteness well let me spell this right def night this of the hessian right so we basically see that uh what if the hessian uh of f of x star is a positive semi-definite sometimes you'll see this written as like this kind of greater than or equal to maybe this greater than sometimes it's written in a little scripty fashion right but this is basically saying positive semi definite right that's just shorthand notation right so if this is the case then we know x star is a local min right and then also we know that if the gray the hessian is strictly positive definite right then we know x star is a strict local min right and then otherwise x star maybe we should write that down right otherwise x star may not be a min right okay so um again this is basically long story short this is kind of encapsulating this is sometimes also referred to as the second partial derivative test or sometimes referred to as second order optimality conditions right so um all of this is basically uh you can see pivots on the definiteness of the hessian right so we should maybe ask ourselves an equivalent question of okay uh how do we test the definitions of a given matrix right so um tell you what let's let's erase the boards and uh we'll investigate that now all right so like most things with matrices uh they seem to boil down to the eigenvalues so let's look at maybe theorem number two which actually relates the eigenvalues of a hermitian matrix with its definiteness so again let's go ahead and just say let's consider some matrix m it's just some square n by n hermitian matrix so again remember that this idea of hermitian is basically the complex extension of symmetric right which is definitely the case that we have with our hessian matrix right because our hessian is a uh set of mixed partial derivatives right we know that it's symmetric and actually furthermore for most cases um you know your cost function is going to be real valued so you're really going to be ending up with you don't have to think about hermitian right because we know that the uh the hessian matrix is going to be symmetric and real so basically this applies right again hermitian is just the extension of symmetric matrices so all of this is going to hold for our case so what is interesting about this is that this matrix is going to be positive definite if and only if all of its eigenvalues are positive right it's going to be positive semi-definite if all the eigenvalues are non-negative it's going to be negative definite if all the eigenvalues are negative it's going to be negative semi-definite if all the eigenvalues are non-positive and it's going to be indefinite if it has at least one positive eigenvalue and at least one negative eigenvalue so again just to maybe uh illustrate this let's write this maybe in green you know the other way you can interpret this table is again the the other alternative notation right for positive semi-definite we said was m is positive or sorry positive definite m greater than zero right and this is true if what if all the lambda i's are strictly greater than zero for all i right that's the first one okay then again positive semi-definite you may see this written as m greater than or equal to zero right so the matrix is positive semi-definite and again if and only if lambda i is greater than or equal to zero right for all i and again we can just fill out the rest of this table so negative definite is sometimes written as m less than zero right and this is true if lambda i's are all less than zero for all i and again m is less than or equal to zero or negative semi-definite if all of its eigenvalues are uh less than or equal to zero for all i right and again maybe you we should make a note since we're looking at the the sign s i g n of lambdas um you might be wondering well is this possible that uh you have complex eigenvalues and again we don't need to do that right we don't have to consider that because this hermitian matrix or this this symmetric matrix in our case right we know that the eigenvalues of a symmetric matrix are always real so you don't have to worry about complex eigenvalues so this should cover it for our situation right so what's interesting is uh the definiteness of a given symmetric matrix right is governed by the sign of all of the eigenvalues right so uh tell you what let's jump over to mathematic and maybe we'll look at what a couple of these functions look like if you have these different eigenvalue scenarios so we're going to look at a case of a positive definite matrix a positive semi-definite matrix a negative definite matrix a negative semi-definite matrix and an in indefinite matrix so tell you let's jump over to mathematica look at those in relationship to the to the eigenvalues all right so coming over here to mathematica let's just go ahead and define five uh symmetric matrices so again this first one we're gonna see has both it has two positive eigenvalues the second one has a positive eigenvalue and then one that is zero this third one has two negative eigenvalues the fourth one has a single negative and a single eigenvalue of zero and finally this fifth uh matrix has whoops sorry this should be i don't know why this is not sitting nicely but this fifth one should have a single positive and a single negative eigenvalue okay so now let's again look at what does the function x transpose hx look like so we can understand how these eigenvalues affect the shape of the scalar function and its definiteness so again let's go ahead i just made a little bit of code here i'm going to scroll past this because it's not terribly important but let's look at this first major symmetric matrix which again we see has two positive eigenvalues strictly positive eigenvalues here's what the overall scalar function of x transpose h x looks like and if we plot this you can see that pretty much for every single value of x that is not equal to zero this thing is greater than or equal to zero and in fact it's it's always increasing as you move away from zero right so again here is clearly a positive definite matrix which correlates with these two positive eigenvalues right okay moving on to the second matrix right we see that it has eigenvalues of positive four and exactly zero so in this case we have a positive semi-definite matrix meaning that if you choose any x uh x1 x2 that is away from the origin this function value is guaranteed to not decrease but it's not guaranteed to increase so again for example you can see that there are some directions for example along this line if we stare at it right down the edge you see that it basically stays flat right so there are some directions where you can move and the cost function value does not increase but it also does not decrease as well so we see there's there's possibilities for getting uh multiple local minima not strict local minima same thing with h3 now if you have two negative eigenvalues right we have a um a negative definite matrix right and then similarly for h4 where you have a negative eigenvalue and a single zero again we have a negative semi-definite matrix where again there are directions where the cost function does not decrease but it also does not increase and finally the situation with the fifth matrix where we have both a positive and a negative eigenvalue we see we end up with this is exactly the saddle point situation we were talking about earlier where if you move away from the origin in some directions the cost function increases in others directions the cost function can decrease so basically we notice that from this example that the behavior of the definiteness of the matrix and therefore its resulting function of x transpose h x changes depending on the eigenvalues and again this is basically what we tabulated on the board that if you have only positive eigenvalues you have a positive definite matrix if you or if you have eigenvalues which are greater than or equal to zero you have a positive semi-definite matrix right and if you have negative eigenvalues you have a negative definite matrix if you have less than eigenvalues are less than or equal to zero you have a negative definite negative semi-definite matrix and if you have a mixed set of at least one positive and at least one negative eigenvalue then this function of x transpose hx it could be positive or it could be negative or increasing or decreasing just depends which direction you're in so you have this kind of indefinite scenario okay so with that let's jump back over the board and uh combine theorems one and two okay so that is exciting because now that we can tie the eigenvalues of that matrix to its definiteness we can basically take these results and now come up with a fairly uh fairly easy to implement scheme for this second order derivative test right so this part here of trying to understand if the hessian is positive semi-definite we now see that the equivalent test is basically if we can write this if the hessian of x star right if all the eigenvalues are positive right or or positive or greater than equal to zero has all eigen values greater than or equal to zero right that's the same then x star as a local minimum so great this is a nice concrete way we can test this right you just calculate the hessian take its eigenvalues and check are all of them greater than or equal to zero if so then it's a local minimum and again if it's strict right again the equivalent test right here is if oh gosh i'm running out of space here i guess i'll write it above it sorry to be inconsistent but right i'll do it in green so hopefully this is all uh reasonable right has all eigen values strictly greater than zero right then you have a strict local minima otherwise you could run into these you know local maxima or uh local strict maxima situations so anyway this is a pretty reasonable approach right so all we got to do step one find all of the stationary points of the function and then we can do our our uh second partial derivative test by basically calculating the hessian and looking at its eigenvalues okay so um let's do that let's in fact let's take a look at an example to illustrate how this would work all right so let's take a look at an example so i'd like to consider this cost function here right it's a two-dimensional cost function with an x1 and x2 as the independent parameters right um so it's this and i just wrote it out in an expanded fashion down here it's going to make taking partials a little bit easier right so according to our scheme here uh step one is we just need to find the stationary points of this cost function so i'm going to first go ahead and compute the gradient so i'll take the derivative the partial derivative of this with respect to x1 that's this term and then the partial derivative with respect to x2 that's the second term so we get this system of two equations and two unknowns now we need to go ahead and solve for where uh is this gradient exactly equal to zero right we want to find out where the slopes are zero those are going to give us x star solution so if you solve this you throw it into mathematica and again um i'm going to defer that until a couple minutes we will go to mathematic and i'll show you that i didn't just make all these numbers up but i don't want to derive it here on the board in the interest of time but long story short you ask mathematica to solve this system of equations it'll come up with five solutions in fact the fifth solution is the same as the first solution so really we basically have four stationary points right um there's a stationary point here at the origin which is a stationary point here at zero negative one another one at positive one negative one and then the last one here is that three-eighths and negative three quarters so there are these four stationary points that we need to consider okay so we got our stationary points solved now step two is we need to go ahead and look at the definiteness of the hessian matrix at each one of those stationary points so first let's go ahead and calculate our hessian matrix again it's just our set of mixed partial second derivatives of the cost function so again in the interest of time i don't need to derive it here on the board feel free to go to a tool or calculate it by hand but this is what you end up with right again it's symmetrical so the 1 2 element is the same as a 2 1 element so therefore we know the eigenvalues of this thing are real and now what we need to do is we need to figure out what are the eigenvalues of this hessian matrix at these stationary points so again i'll save you the boredom we'll go over to mathematic and do it in a second but what you end up with here is at this first solution oh no my pen is running out of ink um we have gradient uh uh or sorry the hessian of x star and in fact the we have the eigenvalues is actually what we're interested in so let me put the lambda operator on this so we're taking the eigenvalues of this hessian matrix what you end up with here are uh where did i put it uh oh here okay you get this as uh uh yep here so these eigenvalues are actually also zero and zero so there's two eigenvalues so so here's lambda one and lambda two maybe i'll write this in a table format so we're not so it's not as confusing okay so for all of these different cases so now the uh at this stationary point right same thing we take the eigenvalues of the hessian at that new point and this is now going to be ah what do we end up with oh here negative one and one okay and then finally for the third point we end up with negative 2.4 and positive 0.4 ish okay and then finally this fourth one here is uh negative 0.75 and negative 0.28 great okay so now all we need to do is start walking through each one of these and applying our tests so we see in this case we are this is inconclusive right and in fact we're going to see that this is actually going to be a saddle point okay same thing here here we have a positive and a negative so this is also inconclusive and it's actually also a saddle point okay this one it's the same thing we have a negative and a positive so again it's a saddle point and now this one finally we have two negative eigenvalues so we know that the hessian is negative definite so this is a strict local maxima right or maximum i guess right okay so that's what we end up with the analysis tell you what let's jump over to mathematica and let's visualize the scenario let's plot let's plot the cost function and then we'll look at uh each one of the we'll look at the cost function at each one of these stationary points and see how it reflects or how it behaves uh similarly as predicted by these eigenvalues all right so here's what the cost function looks like it's this orange surface and again these yellow uh green dots are the four stationary points that we solved for so uh what we can do is let's zoom in and make the plot much closer at each one of these green dots to examine how does the cost function behave near these stationary points and again we know that it should function and behave very similarly to that second order taylor series expansion of the function around each one of these points um so therefore this eigenvalue analysis should hopefully be valid so again if we zoom in here's the first stationary point remember this was the stationary point at zero zero right and as you can see we said that the eigenvalues for this were zero and zero so the test was inconclusive and actually what we see here by just examining this a little bit more closely is we see that yes this is indeed a saddle point because it appears there are directions that you can travel where the function increases other ones where the function decreases etc that was a similar story for um for uh stationary point number two right this was the second stationary point at uh x1 of 0 and x2 of negative 1. and again same thing because the eigenvalues were uh what they were split we had an eigenvalue of positive one and negative negative one we see again the function is basically acting like a another saddle point here same story for the third um stationary point again we we zoom in on this and you can see it does definitely look like a saddle point in some directions the function goes uh up and other places it goes down and finally we come to the last stationary point which we had an eigenvalue of negative 0.75 and negative 0.28 and therefore it shows that this is a local maximum and furthermore a strict local maximum and look that is exactly what it looks like right this is the the highest value of the function is here at that green dot so uh as you can see this method seems to work well okay and the reason i said this method seems to work well is because it can be a little bit uh let's call it fragile um meaning for example what if we had a different cost function right still a two dimensional cost function but let's say you're going to change this let's make a new cost function and you know something fairly benign how about something like uh cosine of x1 times x1 x2 all over uh i don't know uh x2 plus 1 squared you know it's not something completely outlandish right and in fact you can plot this so here's a here's a mathematical plot of this surface and we can clearly see that yeah there appeared to be stationary points um on this uh on this function where there's no slope so now all we have to do is go ahead and compute the gradient and go ahead and try to solve um for this so again you can do this you can compute the gradient right and that's actually not a big problem so this now just gets changed if you compute the gradient of this thing it comes out to well you know what i don't think it's worth it to for me to draw it here on the board in fact i'll just show you a screenshot of me trying to do this in mathematica and again you can compute the gradient no problem right here it is here's the gradient now the problem comes in when we try to solve for where this gradient is equal to zero right you try to throw that at mathematica or some other solver or heck you try to do it by hand and you're not going to be able to actually do it right so here's the problem analytically finding the stationary points by trying to solve where the gradient is equal to zero is not always feasible and actually for the most part it's actually usually quite difficult so this method falls flat right here because we can't get the stationary points of this so uh that's i guess not good and bad it's good in the sense well i guess it's bad for us because this it means this method is a little bit dead in the water right now because we can't even find the stationary points but it is good because that motivates our discussion of the next topic of well if we're not able to analytically find these stationary points and therefore find local minima or maxima or any of these extrema is there a numerical routine or a numerical approach or algorithm that we can throw at this problem to try to a find the stationary points and then b assess their optimality um as we go forward so like i said that's a small little teaser for our next discussion where we're gonna try to develop some tools and processes for that so with that being said i think this is probably a good spot to leave it this was a pretty good comprehensive look at at least an introduction to unconstrained optimization and how we can go about developing some of these tools and theorems and processes and checks to try to find uh and assess uh local minima for unconstrained problems um so i hope you enjoyed the video and if so i also hope you'll consider subscribing to the channel surprisingly if you scroll a little ways down there and click on that subscribe button it really does help me continue making these videos um alternatively you can also support the channel via patreon i want to take a moment to support to shout out and thank some of our recent patreon supporters the nice thing about the patreon support method is that 100 of the proceeds that the channel receives via patreon will be directed towards um k through 12 science engineering and adventures for kids and young adults so with that being said i hope to catch you at one of our future discussions remember the new videos come out every monday and i hope to see you at one of those and we can all learn something new together so until then i think i'll sign off talk to you later bye
Info
Channel: Christopher Lum
Views: 1,904
Rating: undefined out of 5
Keywords: Hessian, definiteness of matrix, local min, global min, local max, global max, unconstrained optimization, convex optimization
Id: 6NB4QiKId2w
Channel Id: undefined
Length: 69min 17sec (4157 seconds)
Published: Mon Sep 13 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.