Jonathan Eckstein - The ADMM, Progressive Hedging, and Operator Splitting (and workshop welcome)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
i'm jonathan next time you probably know that okay so anyway i'd like to think so this is part of dyemax special focus on bridging continuous and discrete optimization and we're says on a later side we're actually it's a whole sequence of workshops and this is the first one and it was organized by i was organized by max of course and simons Institute and the funding comes from NSF and the organizing committee is me for reallies a day who's also from Rutgers and is taking his son to the L SAT right now but he'll be here later today and John Paul Watson from Sandia is there and and Dave Woodruff ok from from UC Davis I'd like to thank the Dynomax folks who put in a lot of work to set this up Tammy carpenter as a lead person Linda Casals who does all the scheduling the website staff Nicole Clark Walter Morris is their systems person and all the speakers who who agreed to attend including one who came on very short notice Jude because we had a slot open up to the cancellations I'm a few people who I thought were going to be here are not going to be here Jean min long dynamic Davis both had like serious personal issues and actually we just heard also the very last speaker on the on the schedule Gabe hackable canceled this morning okay so that we will we'll make a little adjustment we'll probably be able to get out of here a little earlier on Wednesday but so that the there'll probably be a slight additional change to the schedule for for Wednesday okay you don't have to do it this before you give your talk or anything but you have to you know we they want you to fill out this video release form or or you can formally refuse to release but you need to tell them what you want to do so this conference it's basically started with email in December 20 sixteen from Tammy to Farid and then four read to me and we put together like a one-page proposal that went into this bigger proposal NSF in January 2017 we got funded in September and we're the first workshop okay so the idea of the workshop is to bring together like some of you our theoreticians of operator splitting and 80mm and wonderful things like that and then also some practitioners who are using this type of this type of methodology to solve real world problems particularly ones that those algorithms weren't intended for such as ones with discrete variables or non convex functions or something like that so and some of the people doing that so there's there's of this community it's arisen of people using this progressive hedging algorithm to solve stochastic programs which is a special case of operator splitting and 80mm but I think many of people who are using it maybe didn't know that in the beginning so I was trying to bring together people from those two communities and please stick around for a whole workshop if possible the last but it was going to be at a different time but it ended up being the last talk you will see a guy from the astronomy department at Princeton picking apart overlapping images of distant galaxies using an operator splitting method so that should be fun okay so let's see that's basically the welcome so I'm actually like two minutes early to start the talk but whatever okay so yeah I can definitely use it okay actually my normal pace is one minute per slide so this will I will I will just make it them supposedly okay okay so the idea of this talk Patrick Johnstone who's a postdoc who's been working meet with me for the last year and will work for me for another year he's going to talk about some of the recent things I've done what I thought I would do is give a talk that kind of tries to get people on the same page like there may be operators splitting people that don't know progressive hedging and maybe progressive hedging people don't know operator splitting and so I just and I thought I'd just kind of give everyone some context and so hopefully we'll make more of what comes later understandable to more people and I'll throw in a few like historical nuggets at least from my perspective okay so the first part is like I got involved in this stuff in like the late 80s okay and it was not very well known then so I was thought my advisor was Demetri bertsekas and when I started working with him he had me work on these pre flow push he was he was very into these pre flow push min-cost Network flow algorithms and it got a little uncomfortable because there was a big priority fight going on about these algorithms which basically later it turns out no one ever used this but that's a different topic but it was getting kind of nasty and I was sort of in the middle of it and I had also done an implementation and discovered the algorithms as as originally described we're really slow and you could speed them up but when you sped them up you were basing basically turning into the Hungarian algorithm which was already well known so what's kind of anyway so I was getting kind of disgusted with this line of research and so like in early 1988 I told my I went to my adviser says I don't really want to work on this and here's an idea of something else I want to work on and add some idea for doing the same thing with Network flow but using like some quadratic perturbations and stuff and I just went away for a week and I came back and he said John I read your write-up your ideas are not very good I gave this in a talk in his honor at MIT and he said did I really say that but he did say that okay said but they remind me of a topic that I'm wondering about okay and the topic was what is the relationship between the the standard augmented Lagrangian method the proximal point algorithm the alternating Direction method of multipliers okay so I knew about the augmented Lagrangian method I had no idea at that time what proximal point algorithm was and what alternating Direction method multipliers was but he gave me some papers to read okay and this the the situation that existed at that time so augmented Lagrangian method had been developed in the 70s okay and in 1976 Rockefeller showed that in the convex case it is an application of Martinez proximal minimization algorithm to the dual okay and then there was a paper in French in 1975 by Glavine ski and Morocco that actually had the 80mm in it but only as a heuristic okay so they basically said they were using they were using augmented Lagrangian they were using an alternating minimization method to solve the sub-problem and they just discovered by experimentation that the whole thing worked the fastest if they just did one block randomization other block minimization didn't check at all if they admit done anything to really minimize the subproblem just go so they found empirically that worked okay but they had no theory about it okay and so in 1988 when I you know Dmitri handed me the first paper on this stuff that I saw it was a very obscure method in the optimization community and it continued to be so for at least 15 years okay between 15 and 20 years things started to change now in 1983 this edited volume which edited by Horton and Levinsky came out okay and it had two very important papers in it so the first one was by 14 Kavinsky and it was an analysis convergence analysis of the ad mm using variational inequalities okay but it showed that what they had been what Levinsky had been doing earlier did actually converge okay and then there was this alternative analysis by gabay which was the one that really grabbed me that showed that it was actually a dual application of this thing called Douglas Ratchford splitting which had just been proposed only a few years earlier by Leon Mercier okay which is it's a splitting method for monotone operators and you know most of you here know what that is but for the progressive hedging people I will explain that in a second and now Douglas Ratchford method which was proposed way back maybe fifties even when it was originally proposed it was just for solving the heat equation on a two-dimensional grid okay and it was never proposed by Douglas Douglas and Ratchford as anything more than something for solving linear equations so Leo merci a generalized that massively two nonlinear operators that could be set valued and all sorts of stuff okay so we always call it Douglas Ratchford because that's what they called it but we should probably call it Leo Mercier and so these two analyses in this 1983 volume are basically the foundation of every analysis of 80mm that's come afterwards okay and they're very similar and it took me a while to realize that they're not identical okay all right so this is technical segment one and if you're an operator swilling person this is a good time to like check email or wherever else you want to do because it's really you're going to know it very well so all right so this is mostly for their progressive hedging people but so if you can let's just take a convex function and the convex function maybe could be infinity in places so I know it's kind of constraints embedded in it so is this notion of a sub gradient okay so the sub gradient at a particular point is just the set of all slopes of linear underestimate of the function at that point or a finest and underestimated the function at that point and you know so you can consider it that's this thing where you give it a point and it gives you the set of all subgradient s-- at that point it's like a point this set map okay and they have this there's a standard property of monotonicity which is basically if y is a sub gradient at x and y prime is a sub gradient x prime then the inner product of the difference is it's not negative and it's actually extremely easy to prove that in this case because it's just like okay I take an underestimate of this function I go to this other point okay and then I go i take another underestimate i basically have to land up underneath where i was and that's the proof okay okay and this is called being which we have a point to set map with this this property it's called a monotone operator and not all monotone operators are sub gradients of things okay and it's called monotonicity okay because you can it's sort of a generalization of a function in r1 being non non decreasing it basically says if you take the graph of this point to set map okay so here's like here it's a vertical place that's where it's multiple valued just says okay in r1 it just says if i connect any two points on this graph i can't have a negative slope okay and that leads that leads immediately to this thing I call the representation lemma which just says okay if I have any point T I just pick up constant C it can be 1 okay so some positive constant C and every point in the space can be expressed in at most one way as X plus C Y where Y is a sub gradient of the or Y is in them at the image of the operator at X okay and the this here's a picture proof of it which is very simple it's basically okay if I just graph the line or plane of you know X plus CY equals constant I get this thing that has negative slope okay and so I if I have I can't intersect this this set in more than one place because if I did I would be connecting two points on that graph with something with negative slope okay and in higher dimension or infinite dimension it's the same idea it's little proof is like two lines law okay basically if you have two points that both satisfy you know to two things like this that are equal you know basically you have to have a negative non positive inner product between these two differences and that the only possibility then is zero because if you have not okay now I'll step beyond that is what's called Minty's theorem and that says okay if I have an operator which is what's called maximal or buskin combats call it maximal amount to if I have one of those okay then which means that I can't add any points to that graph without destroying the monotonicity property okay then at most one way becomes exactly one way that every point in this space can be decomposed in this way like X plus y where where Y is in T of X okay so I think of that it's just a generalization of you know if you have a linear subspace you know every point can be decomposed into a solve all point in subspace and a point in the orthogonal complement this is a generalization of that any monotone maximum amount up to an operator gives you a decomposition over every point in this space in this way okay now for example if the convex function is lower semi continuous which is a very very weak condition okay so any non disgusting function is basically lower semi continuous then it's going to it will then the graph will be maximal and you're fine okay okay okay so here is for the progressive hedging people here's a simplified way to think I'll leads you into this hole operator splitting idea so I'm going to define this so I'll take a constant C and an operator T and I'll define this reflection map as follows I give the map a T okay I know I can if I have one of these Maxima monitoring operators I know I can express it uniquely this way okay as X plus C Y where where Y is in T of X but I and then what I return is X minus C Y okay so this if if T you know if you were doing this thing where you're decomposing into a subspace an orthogonal complement this would be reflecting through the through this subspace it turns out this map is not expensive okay it's Lipschitz continuous smooth modulus one okay and it's defined everywhere of T is maximal and it's fixed points are what are called the zeros or roots of the operator place where zero is in the image which if it were a sub gradient that those are minimizer's of the function okay the proof is really simple okay you just say okay I just I take the difference of these two points and expand it out I take the difference of these two points and expand it out and the difference between them is twice this term okay so this is the difference of reflections that's the difference of the arguments so his difference of reflections equal this it's for C times this inner product less than this guy okay and this thing cannot be negative so there has to be less than or equal to not expensive okay done and then the fixed point property is basically well if it's a fixed point X plus C y must equal X minus C Y which means Y is equal to 0 and then X is equal to T and you're done okay so it's very simple okay so this suggests okay so maybe we could have an algorithm based on this okay we have some not say say I want to find a root you know 0 of this operator I have some non expansive map okay so I could just try it and it's fixed points are the points I want so don't just try iterating it okay but okay the Lipschitz constant is one okay it's not less it was strictly less than one we were just converts linearly okay but it's all it's one so we could just orbit you know just like never get any farther but never get any closer okay and the amazing thing so this was first discovered in 1955 is if you're trying to find a fixed point of mapping with Lipschitz constant one you can still just iterate it what you do is you blend in a little bit of the identity and it works so this was it's also called a Fresnel scheme an iteration I think Kresna Celski probably published at first but the blending factor here for historical reasons I'm calling it I'm having it be between zero and two it's f zero and then I divide by okay but if geneco if I set this blending factor to be one I'm just iterating half of the identity plus half of this reflection map and that will converge we in infinite dimension weakly but okay and this map is called the procs map okay and you can think about it in the following way I give it a T it decomposes it to X plus C y where Y is in T of X but instead of X minus C Y it just returns X and it's called the resolving right and rockefellers proximal point algorithm is basically just to iterate this map okay except you're allowed to vary the C as you go just can't let it go to zero okay and that can be generalized in many ways you can put in in exact computation as long as you're not too far off you can vary the mixing factors as you go etc okay so the question is all right that's fine but is that anything you could actually do and the problem is this procs map is not always that easy to compute okay so for a general function the way you compute that procs map is you minimize the function plus a quadratic perturbation okay which is probably just as hard as the original function so it's you know and this is this is a calculation martinez earlier than I believe so but Rockefeller in 1976 had this insight it said when this F comes from the dual of some optimization problem with a constraint computing this is not really any harder than computing the original dual function and he found and he also found that the augmented Lagrangian method is basically just doing this this iteration on the dual okay so now this kind of getting kind of long winded and I know a lot of you know this already but I want to talk about operator splitting so the idea of operator splitting okay is that suppose we have we want to minimize the sum of two or we want to find a zero of the root of two operator so that basically means sum sum X where there's a Y is in the image of one and minus y is in the image each other so they cancel out okay it's equivalent to say minimizing the sum of two non smooth functions that are convex okay and suppose I can do these procs or reflection operations on the individual operators but I can't do it it's too hard to do on this guy so there's some special structure there and the idea is can you converge to a minimize or some only through evaluations of these these things and the answer is yes okay and this kind of canonical way to do it is to so okay I have one non-expensive map which goes with this operator I have another one that goes with the second operator okay and so it's a simple observation if I take two non expensive maps and compose them I get another not expensive matter right and it turns out in this case it's convenient okay because the fixed points of that map turned out to have exactly this form like they're X plus C Y where Y is in t2 of X and minus y is in t1 of X so that if you have a point like that you have a solution okay and the proof of that is pretty straightforward actually also it's kind of like two equations in two unknowns so we could just write so hey we have a non-expensive map it's fixed points are basically give us a solution to our problem so let's just iterate it okay well it's not expensive me you know the Lipchitz constant might be one so we have to iterate it blended with the identity a little bit okay a little bit or a lot okay and the case where you blend at half and half of the identity that's Douglas Ratchford splitting if you don't blend it all of the identity it's called piece morass or splitting but then in general it doesn't converge it will can you know converges if you have some additional assumptions okay there are other things like this like there's forward backward where you basically just take a for you know just to evaluate the operator don't do one of these proc steps then do a proc step on the other operator Zhang's forward backward for projective sweating there's some other things you can do like this but this is the basic idea okay so now we get to this thing called the ad mm okay so I'm trying to I have two functions F and G on different spaces and a linear map between them and I want to solve this problem thanks I write it this way this is Lagrangian so basically the idea is if I take the dual of this thing if I take the dual function of this it splits it splits nicely into two functions it's like a sum of two different functions of the of the Lagrange multiplier and so what you can do there concave I just flipped them around so they're convex and it turns out so what Gabbay figured out in 1983 is if you take that that the problem of minimizing the sum of those two pieces of the dual function and you apply that Douglas Ratchford splitting idea to it out comes this algorithm which is called the alternating direction method of multiply so you just minimize the augmented Lagrangian with respect to one block of variables then with the other block of variables then you multiply or update this so this is exactly the thing Dec Levin skier Morocco found heuristic Lee in 1975 but then now they know it conversions okay there's an interesting thing there's also this 410 Glavine skee proof of the ad mm which you know originally I thought oh this is just a more complicated version of the Gabbay proof but it turns out it isn't it's really based on if you write the Lagrangian and split it into two pieces and then it turns out lagrangians also have sub gradients and those in a certain sense and those sub regions are also monotone and you can use that property to get the following this is called phaedra monotonicity we're always getting closer to the set of solutions or at least you're not getting farther from them and it turns out you're getting closer basically the fixed points the algorithm look like this they look like an optimal Lagrange multiplier optimal Z variables and this see scaling factor and you're always getting you know you basically aphasia amount of Tennessee always being drawn into points like this the weird thing is the gabay Lee Amish analysis it looks almost the same except as except as an ordered pair you have a sum here this is kind of in a lower dimension and it turns out it's not exactly it looks really similar it's not exactly the same and for example if you include over relaxation you know like having that mix in get in the Gabbay form having a mixing factor it's not always a half okay you actually get slightly different things with these two analyses so it's like two families of very similar algorithms and they intersect they're identical when they mix it when you have Rho equal 1 when the mixing factor is 1/2 but they're actually slightly different otherwise I mean they resemble one another right they're not exactly the same ok now I have like nine minutes to go through so that's the end for the four that's the end of the background for those who might not have had it now I want to talk about progressive hedging now this is an algorithm that it's it's a it's a big time application of operator splitting but a lot of sort of math people in the operator splitting world don't necessarily know this algorithm so I'm going to introduce it to you ok so what I'm going to do I'm going to take my set of variables and I'm gonna chop it into P blocks okay and on each of those people are going to have some convex function and then I'm going to have another map a linear map on each block and I'm going to look at the following problem that I want to minimize the sum of all these functions this is a separable sum here all these x i's are different sub vectors and then I apply a linear map basically to each of these vectors and I get another vector in this product space and I say it has to be in some arbitrary sub space so this thing this V is the only thing that links everyone together okay and okay and if you apply the 80mm to this set up it means okay F is this function G is a function where I just feed it an argument and if the arguments in this subspace of V it says okay 0 it was not in the argument V it says infinity okay and then I have a certain linear operator here this M okay and now so with this set up that problem on the last slide is just this a DMM type of format okay you want to minimize F plus G of f of X plus G of M X okay and if you just put that into a DMM and turn the crank you get the following algorithm where I I do I minimize each of these functions F but with a certain quadratic perturbation and then I take I apply the linear map to what I get and I project that onto the subspace this is actually minimizing the function G that gives me a Z and then I do a multiplier update now just as an aside any time you're doing a DMM with a function like this you're applying this thing called spin current spin guards method of partial inverses which you can look at as a special case of Duncan structure splitting okay so now what does all that have to do with stochastic programming well here is some tree okay we say s stages okay maybe this pointer isn't the greatest knowledge let's go up here okay yes s stages I have P last stage scenarios okay and so here's what I'm gonna do I'm going to take I'm gonna take the decision variables at each stage and I'm going to replicate them okay so at the root I'm gonna have P copies okay I mean actually got P copies at every stage okay and the Z's are the that's the X variables the Z's are the same except I just don't have the last stage in there okay okay and now I'm gonna formulate a linear subspace that says all those first stage variables have to be the same okay second stage variables that correspond to the same node in that tree have to be the same etc it's called the it's one of the worst words in suppose at English non anticipate if ax t okay right so anyway so if I now define a function f I so H I is going to be the cost function for scenario I supposing I could see the entire future okay and in fee and infinity if I do something infeasible just wait that by the probability of the scenario that's F M is just the matrix that drops the last stage variables from a scenario and the stochastic program is exactly in this form that we just said okay and then if you apply a DMM to it you get the following thing where you take every scenario individually like you could see the entire future and minimize it with a quadratic cost perturbation okay then all those variables that are supposed to be the same to be non anticipate ofnot to see the future are averaged and then I and then I do a multiplier update on that constraint that everything has to be the you know same if you can't see the future and I and I okay now this isn't exactly what rockefeller and wets proposed because they used a probability weighted inner product which basically gives you the same algorithm except now the averages are probability weighted but okay so all right so closing almost everything I talked about wasn't known by 1990 okay not necessarily by the same people but everything was know by 99 okay and I did this theory in my dissertation I thought this is cool and I tried some applications but they were too standard like Oh R type optimization problems like min-cost flow and they didn't work very well and so I went and did other things and progressive hedging didn't really catch on for a while but now you go forward 20 years and people found out 80mm really works nicely for you know l1 regularized quadratic problems and there's this nice pie SP system for modeling quadratic you know stochastic programs that has a you know nice Python modeling language and a good implementation of cursive hedging so now these algorithms are kind of popular again but there's some issues like one thing that the progressive hedging people really like to do is they often there we have real problems that have discrete variables and it's very easy to just heuristic aliyah probably progressive hedging problems like that much and it's much more tractable than say forming a giant LP and just feeding it to go Roby and crossing your fingers okay our giant map and in crossing your fingers but though the theory doesn't really tell you much what happened there but it turns out you know some works been done recently so you know Ken doing something like that be more than heuristic and you'll see in Jim Lucas talk it can be it can give you some nice Lagrangian bounds we can apply these ideas in other contexts as well non convexity is another important area so you know what happens when the problems non convex can it can 80mm things like that be more than heuristic so here's a standard example of problems people have been asking you about school like dictionary learning so I have some matrix D and I want to approximate it as the product of two lower dimensional matrices okay oh maybe with some regular risers okay and I'd like to solve that problem well it's very attractive to do 80 mm because you saw each sub-problem is convex but it's not a convex problem so like can you but this is you know engineers and machine learners so if you really want to solve problems like this okay so that's another interesting okay and that's basically it we're gonna hear about things like that and we're gonna hear about many other things that have been done on the convex theory side as well and let's get started [Applause]
Info
Channel: DIMACS CCICADA
Views: 1,213
Rating: 5 out of 5
Keywords:
Id: T-JUf23khZU
Channel Id: undefined
Length: 34min 1sec (2041 seconds)
Published: Tue Aug 14 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.