Lecture 8 | Convex Optimization I (Stanford)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
this presentation is delivered by the Stanford center for professional development okay so we're gonna really start this topic of duality last time I think I did nothing but say a few things about it last time that we're kind of incoherent but maybe they'll make them coherent today so what do allottee is is about I mean one way to think about it is it's a way to handle in a way we'll see what that means in a minute you handle the constraints actually by incorporating them into the objective that's that's the the basic idea so let's see how that works and it fits in exactly with what we I was talking about last time which had to do with this concept of irritation versus value for a function and the idea of a hard constraint where you go from completely neutral to infinity irritation as opposed to as soft something like an objective in an objective it's just a linear irritation function or something like that that means you the larger the function is the more irritated you get and the smaller it is actually the happier you get so we'll see how that that's going to tie in to the idea of duality all right so first we start with a couple of definitions we start with a problem like this so minimizing objectives I mean equality constraints and equality constraints we are not going to assume this is convex and in fact some of the most interesting applications of the material we're going to talk about now occur when this problem is not only not convex but it's problem known to be hard so that's going to be some of the most interesting applications we form the Lagrangian the Lagrangian is simply this I mean it's really kind of dumb you take the objective and to the objective you add a linear combination of the constraint functions and the Equality functions right so that's it now the Lagrangian is a function of these the it's it's it's also got some other arguments of course it's got X in it but it's also got lambda and mu and these are these are called the Lagrange multipliers there's also two named dual variables here all sorts of names for what these are actually in specific context they can be other things they can actually literally be prices penalty rates and you'll hear all sorts of other names for them in other fields like economics and things like that so they've got lots of names we'll see more some of those later or where they got the names okay and here you'd say for example that lambda 3 is the Lagrangian or dual variable or price I'll justify that term soon associated with the third inequality so fi of X less than zero that's what lambda 3 is and in fact you can even sort of get a hint at what's going on here if for example lambda 3 were you know 6 let's say that would mean the following up here of this says that if F 3 is less than less than or equal to zero it's perfectly okay if it's bigger than zero it's infinitely bad when you when you replace this constraint by this charge this is it lambda becomes the price or you can call it a penalty and it basically says if I can be positive that's that's not a problem now but you're going to pay for it and for every unit that fi is positive you're going to pay six here whatever those units are in now the flip side is actually quite interesting you actually get a subsidy for coming in under budget up here as we said last time there's absolutely no difference between fi of X F 3 of X for example being minus point zero zero one and minus 100 absolutely no difference both are feasible and you have no right to say that you like one better than another period because it's not an objective it's a constraint that's the semantics of a constraint and you can wave your hands and say oh no I like minus 100 because then there's a margin if that's your feeling then you haven't set up the problem right ok so now what when you when you convert that one constraint to this sort of thing actually something interesting happens when F 3 is less than 0 you actually get a subsidy because it's 6 times whatever margin you have so in the when you convert this constraint to a term in a Lagrangian you actually get a subsidy for coming in under budget and you do like F 3 equals my one hundred more than you like it - point zero zero one you get a subsidy so we'll these are the I'm just sort of you know hinting at the ideas that we're going to see soon okay well that's the Lagrangian nothing's happened so far then we look at the so-called dual function or the Lagrangian it's got other names but let's look at it so the Lagrangian it's actually very very simple it just says minimize this Lagrangian over all all X's that's what it says just minimize the Lagrangian now actually it's very interesting because what it's really saying is this and by the way it's a function now of the rim of these prices or dual variables so what this all sorts of ways to interpret this you can't er prett this is sort of I don't know that you know the free-market approach or something like that you know this is the constrained approach where you simply say by Fiat fi of X has to be less than zero H I of X has to be zero that's this then you could say well you'd say no you know what actually we're going to let fi float above zero no problem we'll charge you for it if the land eyes are positive that would be the meaning here um but if fi goes less than zero will actually pay you for it you'll be subsidized and this is sort of the optimal cost under these sort of market prices so now all of this will become clearer as we move on with this now here's an interesting fact if you look at this function as a function of lambda and knew what kind of function is this as a function of lambda and knew for each X it's a fine yeah it's linear plus a that's a constant okay so it's a fine and the infimum of a family of affine functions is of course concave so this function G which is something lot you can think of it as sort of like the optimal we'll get to it later as a function of the prices even if the original problem even if these things are not convex sorry these are not a fine that's not convex not convex this thing is is going to be this dual function is absolutely is concave okay so that's all right now we get to something I mean this is all very simple but it's one of those things where you get a sequence of like twelve simple things and you know the right sequence of twelve simple things will lead you to a very interesting thing so trust me that's what we're doing here so the first thing you observe is the following it says that if these dual variables or prices are positive not negative these are the ones associate of the inequality constraints um then it says that G of lambda nu is less than P star that's the optimal value of the original problem okay so actually quite an important statement looks very complicated it's quite profound it says basically if you can evaluate G this dual function you're going to get a lower bound on the optimal value of the original problem that's what it says so the argument is just embarrassing Berra Singh Lee simple let's do it this way look at this thing and imagine X is feasible for any feasible X well forget does what that is whatever it is for any feasible X fi of X is less than or equal to zero but if lambda is bigger than equal to zero this this term is less than or equal to zero therefore this whole thing is less than zero if you're feasible H I've X is zero and so it doesn't even matter what the sine of nu I is this is zero and therefore it says that the Lagrangian is less than F zero of X or any feasible X now for infeasible X's that's false but for feasible X's it's absolutely true that L of X comma lambda nu is less than or equal to F 0 of X because that's 0 that's less than or equal to 0 and by the way note note the level of the mathematics being employed here it's quite deep it relies on the very deep fact that the product have a non positive number and a positive number and a non negative number is non positive oh and also that you can add such numbers up and get a and you still have something less than or equal to zero so I just want to point out nothing has been done here I mean it's just embarrassingly simple ok now if you then infom i's this over x then well obviously it's less than FF 0 of x tilde that's true this is true for any feasible X tilde and there's no conclusion possible other than this ok so you can evaluate this dual Lagrangian dual function you get a lower bound period so let's look at some examples let's do least norm solution of linear equations so here's a now of course this is stupid we know how to solve this problem analytically you know how to it's AI doesn't matter it doesn't matter what the solution is I mean doesn't matter we know everything there is to know about this but just an example see how this works well the Lagrangian the Lagrangian function is this two x transpose x that's the objective we had new transpose ax minus B so here by the way you have to write this as ax minus B equals 0 you have to decide what H I H is H is either ax minus B or sometimes B minus ax so all that would happen there is the sign on new would flip or something but it wouldn't matter so I've written it this way X minus B equals 0 so you get X minus of X plus new transpose times ax minus B now we're going to minimize this over X that's completely trivial because that's a convex quadratic that's a fine in in in X and you just take the gradient you get 2 X plus a transpose mu equals 0 and you get this is the the optimal that's the X that minimizes the Lagrangian here all right now we take that X and we plug it back into here to get G so when you plug this in here that's the X that minimizes the Lagrangian you get this and you get some well well first of all let's take a look at it evidently this problem this function here is concave because that is a positive semi definite quadratic form not positive definite but positive semi definite quadratic form no sorry positive definite it's positive definite quadrant matter all we care about is positive semi definite quadratic form by happens to be positive definite mmm no actually doesn't matter in this case because I making no assumption about a whatsoever everything's true no matter what I assume about a ok so this whole function of course is going to is concave quadratic ok so there it is which we knew has had to happen because the dual function is always concave now here's what's interesting and we've already learned something it's not a big deal it's not a big deal because I know how to solve this problem but look at this what it says is the following if you come up with any vector new at all which is I guess the size of be the height of beat let's call that M you come up with any vector new at all and you simply evaluate this function then whatever number you get is a lower bound on the optimal value of this problem period now it might look by the way right off the bat like that's totally in this case it's totally useless if that's a small problem say X is that has a thousand variables or something like that this is goofy because we can just solve we know how to solve this problem extremely efficiently get the optimal solution we don't need a lower bound on it this is actually immediately useful by the way another immediately useful in other contexts in other contexts you might want to solve a Lee storm problem like this with let's say a million variables or 10 million variables for example this comes up fact not quite this one but other ones would actually be nothing more for example if you solve an elliptic PDE you're solving something not quite this but very close to this that's what you're solving except it has a million variables or more so that's what's that's actually what's happening now in that case you can't use your formula from 263 you're a a transpose inverse and then the a transpose goes on one side of the other which I forget but it doesn't matter you can't use that formula when this is a million by million okay instead you're going to use some iterative procedure to come up with an X that approximately satisfies this ok and you're going to have to ask yourself when should I stop well you're going to stop here's a very good way to do it you look at your procedure and you find out if during the procedure and this will generally be the case during the procedure you something some side calculation your procedure is actually going to calculate is going to produce something that looks that's going to be a candidate for this new actually it doesn't make it's no one's business how you come up with any vector new that satisfies this if you evaluate this that number is a lower bound so now you know when to stop you stop when the residual here in ax minus B is close enough to zero okay you evaluate this function and that value is close to this value where you had a new okay then you stop and you don't stop because you're bored you stop because you can now prove you're within you know your one and suboptimal for example and no more so by the way all of that was just an aside so if you didn't understand anything I said that's fine that's so often the case but now it's just a weird aside I hinting that this idea of just the fact that you get a lower bound on a problem actually has fantastic practical use and just in fact many many many practical uses okay let's look at a standard form LP so I want to minimize C transpose X subject to ax equals B X bigger than or equal to zero just standard LP okay the Lagrangian is C transpose X plus and now I'm going to add a lagrunta plier for this that's new transpose ax minus B and then here to put this in standard form you really want to write it you kind of have to write it as minus X is less than zero and so these are the F is over here okay so if I simply form this thing so that explains the minus sign here on the lambda okay so that's your that's your Lagrangian now that way now what kind of function is the Lagrangian in in X hey look at that that's as linear doesn't it and it seems to me that that is false I believe it's the name for something that's not true it's not correct you agree with me if the Lagrangian is not linear in X that's ridiculous I mean it's a fine hmm can you have a can you make okay it'll be done momentarily I never said it come L is a fine in X it's a fine because there's this constant term minus new transpose B here okay now this leads us to something actually it's related to something on the homework from this week which was here's here's a very very simple linear program ready for it no constraints how do you Linney how do you minimize an affine function how do you minimize an affine function actually more than a small number of people were stumped by this because it's such a stupid question I think I just had a stupid question sorry it's just they wanted to make it more complicated how do you minimize in it well let start with how do you minimize linear function what's that what's the minimum of them what's the minimum of a linear function in r2 I have a linear what's the minimum of x1 minus x3 x2 sorry what is it it's of course is minus infinity Sidious I mean yeah yeah you have level curves which are hyper planes and you just go as far as you can in the direction minus C it's minus infinity so there's nothing there's no mystery here what's the minimum of that fine function it's minus infinity okay oh by the way notice that that's a valid so G so if G is minus infinity it's cool it's just minus infinity and it note that it is indeed a lower bound however it's a stupid it's an uninformative lower bound because if someone comes up to me says can you help me get a lower bound on this you can just automatically don't you have to listen you just say minus infinity that works always I mean that as you know so the point is though that G is indeed a lower bound is just completely uninformative okay all right now this one case when minimizing a linear function when the minimum of a linear function is not minus infinity exactly one what's that when the function is zero then that's it that's the only way so in fact if you minimize this over X you will get minus infinity period one exception if this linear part if c plus a transpose mu minus lambda is zero then this whole thing goes away and you get that so the the dual function for the standard form LP is very interesting function so we're going to look at it very carefully it's this it is a function which is linear on this weird affine set okay and it's minus infinity otherwise by the way that function is concave right it's no problem you just visualize it it's I mean it's kind of weird it would be in r2 like my drawing a line like this and saying here are the values 0 1 2 minus 1 and then now you have to visualize this so you can make this a lot you know if you want to make a graph coming up you know slope this line up okay and now what I want to do is everywhere off that line the function values minus infinity so just make it fall off a cliff everywhere else that function is concave I mean well it has to be concave because we know that G is always concave okay but I mean you want to visualize it that's what it looks like so it's a weird thing it's linear but then off this this thin this thin a fine said it falls to minus infinity okay so that's what G is there's a question we get right he's fine oh okay so actually the question is this function which I write this way now you just have to blur your eyes because I'm asking you what kind of function of X is it okay so let's look that is a constant that is a constant vector oh well actually with if I include the transpose here it's a constant row vector therefore this is linear in X and I add a constant so it's a fine that makes sense okay so that's the I mean you have to kind of blur the the details of this out when someone asks you how does this vary with X in fact that's kind of a the point okay now this is really interesting we can actually say what the lower bound property the lower bound is this if so this this function is the dual function it is always a lower bound on that LP now of course if you randomly pick lambda and new you're going to get minus infinity here period just minus infinity okay which in which case is still a valid lower bound is just completely uninformative lower bound it's the lower bound that works for all problems so Universal lower bound all right but in fact we're going to do is this it says if I pick a transpose new suppose I pick nits lambda sorry such that new sets that a transpose new plus C is positive suppose I do that it is non-negative then I simply let lambda equal this thing and then with that choice this is true and then I get the following I get B Trant I get P star as bigger than minus B transpose new provided a transpose mu plus C is bigger an equal to zero so let's see what we've shown and by the way I want to point out absolutely nothing here used anything deep at all nothing deep I mean just the most elementary arguments involving like products of positive numbers and things like that I mean the most elementary arguments so let's see what we've come up with it says the following if you have this linear program here and someone says what's the what is the optimal value of it well it depends on the context if a person has a specific a B and C and actually just wants to know solve my problem you can run some code and solve the problem if it's not too big and if that's what they're interested in okay however you can make a very general statement you can say the following if you find any any vector nu by any means it doesn't matter how you find it's no one's business how you found such a new if you find a vector nu such that a transpose mu plus C is a non-negative vector then you evaluate minus B transpose nu that's a lower bound on this LP period okay and that's not I mean well it strings together three or four totally obvious things but I think you come up with something that's not obvious okay let's do a quality constraint nor minimization so here you minimize the norm of X subject to ax equals B by the way we've seen that prop in fact I guess we just did that problem two examples ago not quite we did the case where this was normal erred and where this was the two norm okay now completely general well the lagrangian is the infimum of norm x minus and then some horrible thing here which is a fine and here we have to be able to minimize norm X minus nu transpose ax this is a constant so it's totally irrelevant and you have to be able to do that now this goes back to the idea of a dual norm and let me go to that so let's let's look at that if I want to minimize say this no there this thing the question is what what is this what you know what what do you get here right and the answer is actually pretty simple comes straight from dual norms um in fact let's we can do it for the two norm first just as a warm-up so the two for the two norm you'd say well look if norm y here is is bigger than 1 into norm then I could I get a line X with it in that direction and and then this by actually in that case this thing sort of overpowers it has enough gain to overpower this one and I can make it go to minus infinity ok now on the other hand if norm y is less than or equal to one Koshi Schwartz tells me that this thing is less than that and therefore this whole thing is bigger than or equal to zero so I could never ever make this thing negative on the other hand by choosing x equals zero I can make it zero so that's clearly the optimal so this is equal to and this you know this generalizes now to a general norm it's either equal to minus infinity if the dual norm of Y is bigger than one or it's zero otherwise okay in fact that's literally the dual norm you can I mean that's it's the from the definition of the dual norm okay so this is this is what you get alright so applying this up here gives us exactly this you know this is this is our Y and and here here you have it so once again the this dual function is not it's not totally obvious what I mean it's not an obvious thing it's it's something it's linear it's a linear function but it's got a weird domain in this case it's the it's the set of points knew where the dual norm of a transpose mu is less than or equal to 1 okay that's that okay you know you can go through the argument here and I won't go through it but actually now you got something really I totally not obvious it basically says it says if you can come up with a vector new for which a transpose new is less than or equal to 1 in dual norm then B transpose nu is a lower bound on the optimizing this problem here's an example here ready for a dual feasible point new equals zero well let's check that's zero zero is definitely less than one and now we have a drumroll to find out what lower bound we've come up with the lower bound is zero and that's actually not particularly interesting here because the objective is zero okay so that's that's what it says here okay so okay so this is anyway this gives you a parameterize dual sorry a parameterize lower bound it's parameterize by new okay now we're going to look at a problem x-ray we'll see it a whole bunch of times it's actually just as a it's sort of a simple example it's a perfectly good example of a hard problem just hard combinatorial problem it's two-way partitioning it's embarrassingly simple it goes like this I want to minimize a quadratic form it doesn't I don't say anything about its I mean W symmetric obviously I can assume that but it's not positive semi-definite wouldn't matter if it were it doesn't matter that's dealt I'm going to minimize this quadratic form subject to X I squared equals 1 now this means X is plus or minus 1 okay and let me just first say a little bit about this problem we're going to see it a lot and just to see you get a rough idea of what it means gum so X is plus y plus X is plus minus 1 so we can really think of this as the following is you have a set of points you know like endpoints and what you want to do is you're going to partition them into two groups okay that's like one group and then the other group will be this okay and we encode that by saying here's where X I is plus one and here is where x equals minus one so that's our so we're going to use the the vector here the variable which is X I which is displace minus one vector to basically encode a partition it's a partition it's it's just that it's a numeric data structure I mean sort of numeric to encode a partition okay all right then let's look at what the objective is the objective is some X I XJ W IJ and let's just see what it is so you sum over all pears if xixj are in the same partition what happened what is xixj one okay and then so that then you you add this to this thing now this is only we want to minimize okay now if xixj are in opposite partitions this is negative so I think that means the W IJ is not I'll hopefully I'll get this right W IJ is a measure of how much i hate's j i did I do that right I believe so because if W is very high it means that if X I XJ have the same sign you're going to be assessed a big charge in the cost on the other hand if X I is is high and they're in opposite things you're going to decrement the cost a lot and you're going to you're going to happiness is going to go up okay so W IJ is basically how much I annoys J but its symmetric so it's some average it's the average actually of how much I annoys J and J and noise I okay now if WI is small it means they don't care much so in fact this now makes perfect sense is you have a group of people you have a social network or something like that and you want to partition it I mean if everybody likes everybody can we do that yeah can you stick them all no can you another one identity sorry which is the what there's best one this it's this trivial here but anyway I won't there be obvious obvious ones if the sign pattern in W were such that like everybody liked everybody except one then it would be very simple you just isolate that that one node and everything but in general actually finding the solution this problem is extremely I mean it basically is extremely hard you can't do it it's so if this is a if there's 100 a couple hundred you can't do it just cannot be done okay so that's the partitioning problem and it's a it's sort of for us is going to be a canonical example of a hard problem oh by the way there are instances of it which are easy I just mentioned you know when there's some obvious solution but I'm talking about the general case here okay now all right oh by the way comes up in tons and tons of other I mean this is actually I mean my interpretation was sort of a joke but the point is it comes up in tons and tons of act of real applications I mean which are quite real I mean in circuit that comes up in partitioning it comes up in statistic I mean just everywhere so my interpretation was a joke but it's a it's a very real problem with with real applications ok now the dual function is this we simply take X transpose WX we add as the Lagrangian tells us to a linear combination of these functions I write them it's X I squared minus 1 so I get this and I have to calculate this is the Lagrangian and the Lagrangian is quadratic ok it's a quadratic function it's right here is w+ diag of new by the way I make no assumption whatsoever on whether diet W positive semi-definite and now we read a few minutes ago we had a discussion about how do you minimize a linear function now we can have a very short discussion about how do you minimize a quadratic function and the first thing you do is let's talk about how do you minimize a quadratic form so what is the minimum of a quadratic form what is it it can be negative infinity I agree when would it not be yeah I wouldn't make way they look it's a quadratic form is positive semi-definite then it's if the only values it takes on or non-negative so it couldn't be minus infinity then so that's exactly the condition the minimum of a quadratic function quadratic form is minus infinity if that matrix is not positive semi-definite so if it if it has one negative eigen value the minimum is minus infinity okay otherwise if it's positive semi-definite the minimum is zero period because it can't be any lower than zero and it can be zero by plugging in zero okay now if you ask what's the minimum of a quadratic function that's a quadratic form plus a linear function it's a little more subtle and there are some horrible corner cases at the boundary but let's deal with some of these things are you going to at least answer some of it the minimum of a quadratic function if the quadratic part it said well it did it's complicated the general let's see here we're just doing a quadratic in X oh great so we don't have the more complicating case sorry I thought there was an affine term so scratch everything I said we'll come to it eventually and it won't be fun when we do it won't be that bad but all right so this is good the simple case works here so this is really simple if this matrix W plus diag of nu is positive semi-definite then the optimal X here is zero and you just get this okay so it turns out that the feasible this is called dual feasible that's when G is bigger than minus infinity is actually given by this that's a linear matrix inequality and nu a very famous one right it's the it's the set that's a linear machen nu that's this is a well that's a linear matrix inequality describes a convex set okay and let's let's review what it says it says the following it says that if you want to partition a group of three hundred nodes optimally there's by the way no way to do this kind of global I mean there is ways to do a globally but they'll just run forever and not get very far maybe so you let you know everybody writes their own partitioner or whatever and somebody partitions it and then says here's my partition and someone else has another partition and so on and so you get the best of all the partitions have been proposed and then you want to know how far are you from the best that's the question and at least one way to at least partially answer that question is as follows if you can come up with some numbers new and it's absolutely no one's business how you came up with them no one's business if you can calculate some numbers new verify that W plus diag of new is bigger than or equal to zero period just that's all you have to verify then negative some of the news is a lower bound on this thing okay so that's that's that's all we know so far and this is not obvious this is not in any way obvious although I repeat it uses the most elementary mathematics to get there I mean here we use the fact that minimum quadratic form is minus infinity unless it's positive semi definite which case is zero I'll skip this for now anyway you'll read about this so this is already quite interesting that you can get a lower bound fact it kind of makes you want to at do something like Maxim get the best lower bound in terms we'll get to that but it's a semi definite program and then you can already see how semi do this once you see that you realize that's why there's a strong connection between semi definite programming for example in combinatorial optimization because this is a toriel optimization problem that's what this is yep tell us about actual optimal exits no absolutely not it's a lower bound now what I haven't said is whether the lower bound is sharp or not yet in this case it's not it's just a lower bound that's all so so far that's all I've said it's just a lower bound in fact then what let me tell you what the Lagrange dual function is for you right now it is a lower bound on an optimization problem gives you lower bound it of course can give you its parameterised by lambda and nu by these dual so-called dual variables now in some cases if you plug in some lambda nu you're going to get the following lower bound minus infinity it's a valid lower bound totally uninformative okay but it's always a lower bound in some cases you plug in lambda nu and you're going to get actually an interesting lower bound and this you know that's not obvious so right now you should just think of it as the lower bound lower bounds can be good they can be tight they can be crappy we're going to get to all that later okay okay I just want to tie together the idea of the conjugate function and and Lagrangian so if you have a function with just linear inequality and equality constraints a problem and you work out what the dual function is its minimum of F 0 plus I collect this together and and multiply by X and then that's of course a constant okay and what this means is the following if I focus on this and then go and look up what the conjugate function is which was the soup over X of Y transpose - what was it something - half of X that's the dual if you plug in also all the right minuses you get this it's equal to that now what that means is the dual of the the lagrunta dual function of this thing is exactly equal to this it's equal to that just period okay now recall I mean actually when we calculated a couple of conjugate functions actually recall the conjugate functions often have domains that are not everything and give you the max function and the and the conjugate function turned out to have it very small it was actually the probability simplex was the was the domain of it so that'll that'll automatically impose some equality constraints in here when that happens but here's an example the maximum entropy problem is maximized some minus X I log X I subject to some inequalities and equalities and by the way that's already this really interesting problem because it's really it could says lots of things it says basically find me the maximum entropy distribution that has these expected values that these are just known expected values these could be moments could be probabilities it could be anything and these are inequalities unexpected values it's really quite a sophisticated problem to ask what's the maximum entropy distribution for example on these points that for example has the following mean the following variance the following you know kurtosis or whatever it it you know something like this has the probability in the left tail less than that and you know you can go on and on make a very sophisticated thing that's the maximum entropy problem that's this thing and if you work out what that is when fi of X is the negative entropy here that's minimize negative entropy you will actually get the sum of X Exponential's so the dual function for a maximum entropy problem is going to involve a sum of Exponential's now if you're in statistics and I said statistics about probability um this will be very familiar to you because it has to do it connection between exponential families and maximum entropy and we'll we'll see more of this later but just just a hint okay now we get to the dual problem and to write down the dual problem I mean it's the dumbest thing ever if someone walks up to you and says I have a bound on a problem I have a lower bound on my problem right but it's parameterised by these this vector lambda and this vector nu and then you say well that's interesting only interesting about a lower bound is well then it's a lower bound and if someone has multiple lower bounds obviously the higher the lower bound the more interesting it is I mean so you could just say all right what is the best lower bound on the original problem you could establish by Lagrangian that's that's the what is the best lower bound we don't know if it's sharp we're just saying what's the best one it kind of wouldn't make any sense they really examine any other anyway all right that leads you to just this this problem right here now I want to point something out this is always a convex optimization problem no matter what the primal problem was oh by the way this is called the Lagrange's just shortened to the dual problem here in fact people say the dual problem the same way we say the optimal point even in situations where we don't know that we don't know in fact that there's a that there's a union an optimal point we're actually going to see this multiple duals the way the way it's you the word is used on the street there are lots of duals but we'll we'll get it for now it actually really is the lagrangian and it says it's to simply maximize the dual function subject to this period just subject to the lambdas being positive that's all okay now often what happens is G is minus infinity for some values of lambda and nu we've already seen that a couple of times that is a not interesting lower bound and it sure is not going to help you maximize something to find a point where this is minus infinity you know well I mean this actually this thing could be minus infinity that can happen the point is it's not an interesting value so in fact well often what happens is you pull the implicit constraints in G out and make them explicit okay now here for example let's go back and look at this the dual function for this LP is this weird thing that looks like this I drew it somewhere oh there it was this sick thing here where you know this thing is kind of going up on a line but off that line the thing falls off to minus infinity and we're to simply to maximize that subject to lamda positive however it's easier to simply take the implicit constraint out and you end up with something that looks like this okay so here you here's that here's the the so called standard form LP and then this is what it looks like when you have actually pulled out this implicit constraint technically speaking this is not the Lagrangian dual so you're given a little bit of license to form the Lagrangian do a little bit of trivial rearrangement and people would still call it the dual or something like that it's technically speaking if you're deposed you're under oh you know under oath or something like that you would say nope that is not the Lagrange's is a very simple transformation this is equivalent under a variant and all under a very very simple equivalents to the lagrange dual of this the lagrunta dual of this thing is that where G is this sick function so I just want to point this out okay and by the way let's see what happens here that is also an LP and let's see what it says here I can say something about this problem if you have a feasible new here then minus B transpose new is a lower bound on the optimal value of this problem period this thing says okay you have a family of lower bounds please get for me the best lower bound that's what it that's that's what the meaning of this problem is um by the way we'll also see this has beautiful interpretations in a lot of cases um so for example in engineering design it will make a lot of sense X will be a sub optimal design for example I mean anything sorry any X here if it's feasible it satisfies the constraints but if it's you'll you'll maintain it that's something that's feasible here would be a suboptimal design okay the new will have a beautiful interpretation it's a certifiable new in that case is a certificate on a limit of performance that's what it is that's the meaning of new here okay actually we'll see that in when you look at a real problem it will have physical significance we'll get the lots of examples but abstractly it's it's a it's a limit of performance now by the way if you're doing engineering you may not care about this if you've been working all day and somebody comes along says how are you doing on designing that thing and you go huh I have got an excellent lower bound this is not generally what you do engineering for oh by the way there is there is some engineering where that's really what you want to do um if this is a question of how how bad could something be if let's say you're at a bank and they want to know okay what's the work how much trouble are we in what's the worst thing it could possibly happen then a lower bound actually gets interesting right so then it's actually and of course in the spots offended but that's the idea yeah that was not technically the garage duel on the right this short that's the lagron - that's why but they're not the same thing I mean it's a I mean they're not the same thing this is minimizing a a function which is a weird function it's equal to you know minus B transpose new provided that a transpose mu plus C minus lambda equals zero okay something like that and it's minus infinity otherwise okay so that's what this is so and by the way if you were very careful it would make a difference let me explain that okay well over here uh if I if the question would be over here what happens if I operationally makes a difference I mean slightly it's kind of silly but here it is suppose I throw in a new for which a transpose mu plus C is not a non-negative vector okay then this you know in this problem when I say how about this new how do you like that what is sent back to me is actually something is is is the infeasible token is sent back to me saying your new is infeasible okay over here it's actually more interesting over here if I throw such a new in or whatever here what comes back to me is exactly what will happen is no problem what happens is the the objective function sends me an ood token out of domain now it's a concave function that means it's minus infinity so either way so you get it basically two slightly different exceptions are thrown in this thing but I want to point out that these are just you know you can call this just silly semantics and all that if you like but it's very important these are not the same problem so okay I mean by the way this is don't focus on these minor things that's something you can think about you can read this think about it on your own and don't let silly little technical things get in the way of what the big picture is big picture is you have an optimization problem and you form another one called the Lagrange period that Lagrangian dual problem essentially is saying what is the best lower bound on the optimal value of the first one I can get using the Lagrangian function that is what's important okay so now we get to the idea of weak and strong duality now weak duality says that D star is less than P start not here let me say how this works okay so D star so in this context this is called the this is called the primal problem or the original problems called the primal problem and this the Lagrange's then called the dual problem okay so that's the primal and the dual and we'll call P we are well we've already assigned the symbol P star to being the optimal value here we're going to let D star be the optimal value of the dual problem okay so optimal value of this is going to be denoted D star you always have D Stars less than P star why any dual feasible point is a lower bound on P star so the best one is also a lower bound so you have these this is called weak duality it's called weak duality because let me review what mathematics the deep mathematics required in establishing this right the deep it hinged on properties such as the product of two positive numbers is positive and the sum of positive numbers so it's weak because you can explain it to somebody in junior high school I mean they might they may might not have taken those 14 steps but the point is nothing nothing in them that's hard so it's called weak okay all right that's weak duality all right it always holds convex non convex it's absolutely Universal it could be stupid you could even you can indeed have D star equals minus infinity in which case your best lower bound is of no interest whatsoever okay that can happen but it's all this is always true okay now if we go to the partitioning problem and we ask what is the best lower bound on the two-way partitioning problem you can get from the Lagrangian that is a semi definite program and now things are interesting because although this is something that was not known 15 years ago and absolutely inconceivable 20 years ago I can tell you this this SDP you can solve it people can solve it you can solve it like that for a thousand variables just no problem here and if you knew what you're doing you could go easily to problems with 10,000 and 100,000 of course EE has to be sparse there that means by the way in your group of people you're partitioning a lot of them don't have opinions about the others that's what it would mean for dougie to be sparse so you it's getting to be a big matrix a lot of zeros you see some large plus ones that's people who hate each other I need to send some negative ones as people who like each other and then you have to partition but that's a sparse don't be anyway I'm getting off-topic here so the point is you can solve this SDP and you will get a lower bound on the two-way partitioning problem that is fantastically useful if you couple that with a with a heuristic for partitioning so you do some crazy heuristic there's lots of heuristics some of them work really well by the way um now you don't expect it to work all the time because you are solving after all an np-hard problem in general so you don't expect it to work well all the time but what happens is you'll you'll do a partition and you'll say here's my partition and here's the number I got it'll whatever it is it's the X transpose W X that's it and you want to know you know could there be a better one you can solve this SDP and in fact you'll see a lot of times the numbers are pretty close at least it's a good thing to know you would know I have a partition but there's no partition that's more than I I'm at most such-and-such suboptimal period and you might just say okay that's good enough all right okay a strong duality this will not rely on element you know highs you know whatever junior high math okay strong duality is going to be that lower bound is tight that says there's a lower bound that goes all the way up to the optimal value that's strong duality and if that's not you can't its will see what its equivalent to but it is it is not that is not trivial um and by the way it often doesn't happen okay so in two way partitioning problem by the way if it were true there then you'd have P equals NP because this problem we can solve in polynomial time and so if in fact P were equal to DP star equal to D star something in fact there's even approximation you know if you have if you know about can competition if you know about complexity and you have problems so if you have something it's not even approximable or something like that then that tells you that P star basic you can't even get something where you can bound the gap or something like that but I won't go into that now here interesting part when a problem is convex you usually have strong duality okay so that's that's actually amazing it's going to have a lot of implications it's going to be equivalent to by the way something it's going to involve the separating hyperplane something we will we'll see what it connects to it's going to have a lot of implications the fact that strong duality holds usually for convex problems now there's a hole into their entire books multiple books multiple courses not here but it's some other schools you could take entire courses read books thousands of papers that elaborates on this one word usually okay um now and and basically is it called constraint qualifications so a constraint qualification theorem goes like this it says if the problem is convex if the primal problem is convex and and then you insert your constraint qualification here okay then P star equals D star that's that's a constraint qualification you could take you could read you could devote your life to this um like on occasion these issues actually do come up but but maybe less frequently in applications than the people who devote their lives to it would like to think I'm saying that of course because they're grad students will watch this and then alert them to it so just making trouble okay W call constraint qualifications now by the way if you're in this industry sub industry sub sub industry of constraint qualifications then this is like the big the sledgehammer the most unsophisticated one there could possibly be this is like the basic one that everybody knows okay this is the least-squares or something like that of the constraint qualification rules it's Slater's constraint qualification although actually I should look in some - anyway I think it's for sure the correct name here would probably be Russian but we won't get into that now almost no doubt I can I even know who to ask I should switch it that would even cause more trouble I should do that actually okay so let's call it Slater's constraint qualification and it says this it says if you have a convex convex problem like this it says if I'll say it I'll make this simple case if there is a strictly feasible point if there exists one then P star equals D star and strictly feasible means not just that you meet the inequality constraints but you do so with positive margin for each one that's the condition okay now I should add that that basically it's completely clear that for most problems that covers everything in engineering pretty much I mean as much as people would make fun of Slater constrained give you you know reasons and they could make examples up why it's not sophisticated enough and sure enough there are problems where you don't have a strictly feasible point but for most problems that come up in engineering any certainly anything in like machine learn any you know probably almost pretty much anything you're gonna have this makes perfect sense right yes for example if the third inequality was like a limit on power right it doesn't make any sense to say you know and and it says that the power of some circuit has to be less than a hundred milliwatts it's sort of like I mean just think about it right so if Slater's condition failed to hold it means there exists a circuit that dissipates 100 milliwatts but there's no circuit that dissipates 99.9999 because if there were its latest condition would hold everybody see what I'm saying here so in the end of that is like who cares you know this one that's a hundred fine the spec is now officially 100 point zero zero one doesn't make any difference everybody that's totally obvious now Slater's condition holds right so I think you this is my general argument it really doesn't make a whole lot of sense to worry too much about this in machine learning if you look at this and sometimes yeah what about Slater's I took a whole quarter on constraint qualifications what if you know what if what if Slater's constraint well what if you're doing something this is a board vector machine or something like that then you look at the person you go excuse me where did the where did the data and the problem come from the answer it came from a bunch of like data likes you know spam filters and features and things like that the data is all who knows about the data who knows where the data came from right and certainly all the data could be wiggled a little bit and the stupid thing should still work if you can if if solving that problem relied on these most fantastically subtle facts as to whether strict inequalities held or weakened equalities and one but not the other health then I Got News for you you're not doing engineering you're not doing statistics you're not doing economics you're doing something like pure analysis okay so that's my my little story on it again there are actually cases where these come up in practice but they're pretty rare and mostly I'm saying this to irritate people at other universities my colleagues who will be alerted to this watch this tape and be very angry so I just that's okay but I thought I'd mention this okay all right so let's go to the inequality form linear program here you want to minimize C transpose X subject to X less than B we already worked out what well we can work out now we didn't do this one but we can work it out it's easy enough G of lambda is C transpose X plus lambda transpose times ax minus B because I put the B on the left hand side to make this F less than zero um I do this and I infom eyes this but we know how to infom eyes a affine function you get minus infinity unless the linear part vanishes so I get this and so this is the dual problem notice this is actually not the dual problem so if there's lawyers present you would say this is a problem that is trivially equivalent to the dual problem okay but you know after a while if and if there are no lawyers present you just say that's the dual problem or something like that so that's it okay now Slater's condition says that if the feasible set then of course the feasible set here's a polyhedron and by the way one possibility is the feasible set could be empty which is in fact a polyhedron this says if what's latest condition says geometrically is very simple it says if that polyhedron has non empty interior that's what it that's that's what that's what this means it means basically there's an interior it's does an interior point okay if it has non empty interior then you have strong duality so you have P star equals D star okay so that's the that's the picture now in fact for an LP you can say much more it turns out for an LP you always have strong duality except for one sick case and that's when both the primal and the dual are infeasible and by the way you get a quite a gap between P star and D star then because P star is plus infinity and D star is minus infinity but anyway let's look at a quadratic program let's minimize X transpose px subject to ax less than B that's minimizing quadratic form over a polyhedron the dual function is this X transpose px and we're going to assume P as positive definite actually that's so that I can avoid those horrible the horrible way to write down it's not that big a deal but the horrible way to write down to infom i's a general quadratic function with the linear term because i don't feel like doing it so this will this'll work though um so here the the Lagrange's well the dual function is U implies over X X transpose px plus lambda transpose ax minus B here like that and now I minimize over X now the nice part is P is positive definite so I know how to minimize this that's I just it's P inverse times whatever something I'm not even going to do it because I can it's easy it's easy to minimize strictly convex quadratic function so I minimize it I plug that X back in here and I get this thing okay which is I get minus 1/4 lambda transpose a P inverse a transpose lambda or something rather and my dual problem in looks like this like that oh by the way this really is the dual problem because in this problem up here notice it's a dual function the domain is all of our let's call it our M it's all of our M ok so in this case the dual function is domain as everything which is to say you get a lower bound for any value any you plug in you take random numbers lambda and you're not going to get a trivial lower bound okay I mean you might get a rather stupid one for example you might get the lower bound minus seven anybody let's talk about the lower bound minus seven here it's it Y is the lower bound minus seven valid for this problem because the objective is always non-negative okay so but the point is you get a lower bound alright and you get this so that's the dual problem and by the way what we're saying here is not obvious at all what we're doing is we're saying you want to solve this quadratic program we haven't yet told you how to do it or how it's done or anything like that but we'll tell you this if you come up with any vector lambda that's pot that's non-negative and you evaluate this concave quadratic function you get a lower bound on the optimal value of this thing this has lots of uses for example suppose someone says I know how to solve this problem and you say how did you do it they go are you joking but that's like that's like IP I could if I told you I'd have to kill you my new method I'm patenting it right now in Kazakhstan today I can't tell you how I did it you said what why should I believe that that's the optimal X how do you prove it you say well watch this you say check out this lambda notice that it's bigger than or equal to 0 and you go yeah then you evaluate that number and that number is equal to the value up here of the point that by the way ends the discussion that X is feasible and by the way you would call that lambda a certificate proving it everybody got this and notice you didn't have to say how you did it there everyone got this and then you'd say hey how'd you get the lambda and you go like I'm going to tell you that either that's like my it's my patent application number two okay but the point is you can do this without the lower bound actually has a value independent of where it came from or something like that you can yes and you have this idea that you can certify the solution okay without revealing actually how you solve the problem originally a or B for that matter how you came up with a certificate so we'll get to these ideas later now Slater's condition says the following if if this polyhedron has non empty interior then these are absolutely equal then there always exists a certificate proving optimality of the optimal X period always so ok now by the way some done a very small number of non convex problems have strong duality I'm not going to go into it because it's complicated and so on this is actually covered in an appendix of the book and I would encourage you to read it this one is not obvious and actually there's a whole string of these you know there's like 15 of them or something like that and they're just weird things that have to do with specific problems that are non convex and just happen for deeper reasons to have zero duality gap okay the quadratic ones are the ones we collect in the at the end of the book and one of the appendices there are others you will see them there they're kind of weird and some of them are quite straight quite strange I mean they might involve like we're the one I've seen recently where it involves complex polynomials of degree four right and then something that should have is highly non kind of a zero duality gap and it comes down to something in algebraic geometry but that's always the way these are these are not simple this is just to say there are non convex problems with zero duality gap of skew okay let's look at the geometric interpretation all right so let's see if we can do this right so we're going to do a problem with just one constraint so what we're going to do is we're going to minimize I'm going to write the the graph of the what I'm going to do sort of the graph of the problem what I'm going to do is for each feasible X or each X in the domain not feasible each X in the domain I'll evaluate this pair so although the problem may be happening in 100 dimensions for every X I'm going to plot a point which is in this plane and one basically this tells you the objective value and this tells you the constraint function so basically everything over here corresponds to feasible okay and then the height corresponds to the objective value so quite obviously that's the optimal value the problem and any point that ends up being colored there is optimal okay so that's clear that's the optimal value P star everybody see that so that's the that's the idea so these points I mean that that point has got a really a very nice objective value but it's infeasible because it's constraint function is positive okay so that's P star now let's see what the dual is how do you get Lagrangian T in this picture well Lagrangian he works like this I'm you minimize F zero plus lambda f one now on this plane that corresponds to taking something here like this and it's got a slope of is it one over lambda or something like this it's let's see if that's you it's slope minus lambda so I take something like that so for example if you fix lambda and then ask me to evaluate the dual what you do is this you fix a slope here and you march down this way until you just barely leave this set and that would be right there okay and then when you work out what G of lambda is it's this intersection here okay so this is G of lambda and now the dual problem says optimize overall lamp in lambda so if lambda is zero you get this you go down there and G of zero is this number right here which is indeed a lower bound on P star it has to be okay now now I crank up the slope and as I crank up the slope a G is rising okay and it keeps rising until you just hit here this point at which point here it's right there okay now as I keep increasing lambda what happens is the optimal point actually is here and this thing is rotating around it's not a fixed point it's rolling the contact but because it's got sharp curvatures rolling just slightly it's rolling along here and as I increase lambda G gets were and in fact if lambda is huge it looks like this and G is very negative it's still a lower bound just a crappy one everybody see this so D star is that point okay by the way the difference between P star and D star is called the the duality gap or the optimal duality gap for the for that problem the question complexity of the jewel left/right hollow relation um we're going to talk about that but I mean it depends very much what how they come so for example in a non-convex primal the kameen like into general two way partition problems np-hard but the dual is an SDP that's easy so they could that casing be infinitely far away now in the case of a convex problem right now it gets interesting so in a convex problem they both you will see later they both solve the problem and a lot of people get all excited they go oh how cool I can solve my problem by the dual turns out if you really know what you're doing the complexity of the primal and dual are equal if you really know what you're doing you will in about four weeks three I don't know whatever it is yes you rule out the right bottom point for P star and just say it was that how did I do it yes well the first thing I asked is I asked this shows you that this shows you the objective and the constraint function for every possible point in the domain okay now that point is not good for one thing it's got a high objective but it's also infeasible so feasibility anybody who landed on the right here is infeasible so in fact these are all very interesting but they're not relevant as far as the optimization problem is concerned so we simply look at these now these every point that got shaded in here is feasible okay the height tells you the objective value and so you want you want the lowest point among these that's clearly right there and you go across here that's B star because your your first coordinate here that is your object is your constraint function and f1 has to be less than or equal to zero that's that's what it means to be feasible okay so that's the picture so here you have a gap by the way this thing strongly suggests something very interesting and you can see why convexity of the problem is going to come in when F 1 and F 0 are convex this this weird set G now what I'm about to say is actually not true but it's close to true that weird set G is convex okay when something is convex you can't you have a gap here because this blob is non convex so if this thing had to be convex you can't have a gap everybody see this that's what's going to happen now I'll tell you the truth G is actually not convex but it's lower left corner it's look which is what we care about is so now I corrected it and said the truth but ok ok now you can also by the way see how Slater's condition works so if you take not G but a that's the set of points that's sort of a cheap beat you can meet or beat in a by criterion problem so basically this is this is what if you take a and then color in all these points here you only care about the ones down here this purrito thing is the only optimal point here and now you can see a will actually be convex if that's convex and that's convex so a will have to look like this Slater condition says that somewhere a goes a positive amount or whatever you want to call it it it's goes into the left side because I can otherwise you can get weird weird I can show you what things with non convex problems not with positive duality gaps look like they look like things where it looks like this and then right as you get to this axis it jumps up some amount and you can easily construct a pathology but these are the kind of things you would study and these whole courses on this topic so that's the idea and you say you can even sort of get how Slater's condition holds on all this okay I mean how it connects to all of this okay I'm going to mention one more thing we'll get to one more topic it's a complementary slackness so let's assume that strong duality holds and actually I don't care if the problem is primal or feasible okay I've convex what I said made no sense whatsoever okay so let's look so what I meant to say was I don't care if the primal problem is convex that's that's what I meant to say but it just came out a weird permutation okay so I don't care if the primal problem is convex of course the dual problem always is convex period so I can care about it all I like but it's not relevant it's always convex okay so let's suppose strong duality holds and let's suppose X star is primal optimal and new star lambda star new star or dual optimal that's that says this that says by the way this is the case I mean this is basically what it comes down to what it says is that X star is an optimal point lambda star a new star you can think of then as a certificate establishing optimality of X star okay by the way these ideas we're going to use them from now on they're going to come up computationally all algorithms are going to work this way all modern methods they don't whenever they you haven't done it yet but whenever you solve a problem it doesn't just say here's X and you have to trust the software or whatever it doesn't work that way although you haven't seen it return to you yet all modern solvers well of the ones we're going to study in this class they produce not just this solution but they also return no exceptions a certificate proving that it's the solution so you don't have to trust the implementation we're gonna see I'm saying so these are going to have very this is just completely standard it's going to be in everything or these ideas are going to it they're going to diffuse through everything we do okay so basically you think of that that's an optimal point optimal design whatever you want to call it a certificate proving that's optimal because that's what because that's a lower bound on P star that's a point that's feasible and satisfies has objective value equal to this lower bound on P star therefore it is P star period okay now this thing by definition this thing is the infimum over all X of G here with these optimal Lagrange multipliers okay but if it's the infimum over all X it's certainly less than or equal to this when I plug in a particular X and I'm going to I'm going to choose to plug in X star okay so I plug in X star and I have the following very interesting this says F 0 of X star is less than or equal to F 0 of X star plus something where every term in that is less than or equal to 0 okay and every term in that is 0 so this one is not relevant ok well we'll get no no sorry we'll get what we'll get to that ok yes sir everything here is is there and now you say now wait a minute here if this thing is less or equal to that thing is less than or equal then that's the same as that then they're all three equal and we have no choice but to conclude that the sum of lambda I star times fi of X star is 0 okay but this more than that wait a minute this is a sum of numbers all of which is less than or equal to 0 if you have a sum of numbers which are less than or equal to 0 and it's equal to 0 there's only one conclusion every single one of those numbers has to be 0 and that says the following lambda I star times fi of X star is actually equal to 0 for all of these ok and that's known as complementary slackness and what this means is the following it says if you have any primal optimal point and any dual optimal point the following must hold if the inequality constraint is tight I'll turn it around this way if the inequality constraint is tight you must have oh sorry I'll turn around if the optimal Lagrange multiplier is positive it can be positive or zero then that thing has to be tight period if a if a constraint is loose at the optimal point these Lagrange multipliers have to be zero okay so that's going to be this is going to tie into also it's going to have lots of implications it's also going to when we give other interpretations of what all this means it's always going to tie in like with these things being prices for example but well we'll quit here for today
Info
Channel: Stanford
Views: 105,570
Rating: undefined out of 5
Keywords: science, electrical, engineering, technology, convex, optimization, least-norm, affine, minimization, two-way, partitioning, Slater's, constraint, qualification, complementary, slackness, epigraph
Id: FJVmflArCXc
Channel Id: undefined
Length: 76min 30sec (4590 seconds)
Published: Tue Jul 08 2008
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.