Algebraic Multigrid I Ulrike Meier Yang, LLNL

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] so let me first give you a quick outline on what I want to talk about first of all why do we even look at magic and methods now you have heard in between today several times people already mentioning it so it's obviously is an important algorithm I will give you a brief comments about algebraic market software that's available within the software packages we mentioned today and then I will actually talk about how it does multigrid work and then I will talk about the hyper software library because that's what I'm working on and particularly will talk about the interfaces and solvers and so just to give you already a little preview the main thing I really want you to take away from hyper as a we're really focusing on magic with solvers that's Arabic even so there are few other services in there as well but multigrid solvers really our focus and B we do provide different interfaces and I will talk about why we do this and I will talk a bit about the solvers and data structures and then I will talk about something about complexities that occur in particularly unstructured adequate service and why it's important if you do use algebraic logic wid to know about this and what to do about it and then I get to the hands-on exercises where I will talk about try to demonstrate a few of the things I've mentioned in here so okay so first why multi good methods so the reason multi good methods are used is because there are actually optimal methods meaning if you have a system I'm in the optimal case and very good you of the size n you actually can solve it in O of n operations and because of this they have very good scaling potential like here in this in this skit here sort of the idealized idea is if you have a weak scaling study like here then the number of processors you increase your problem size with the number of processors what you would like to see is that the time stays about constant across here and generally for another like Krylov solvers we heard about this morning what happens when you actually scale it up the number of it goes up and even so the individual iteration might still be scalable it doesn't always have to be but usually it is you get this kind of behavior but multigrid however the number of iterations can be controlled and in the ideal case it doesn't even increase but it might increase the love that actually cases and so that's the reason why we look at multi great masses okay so there are various a magical software packages available actually like in truly news for example there's ml is their older package they have tried to have a more updated NuLu here but both of them are still available in some cases one works better than the other it depends on your problem Patsy has and I berry already demonstrated this one this morning they have a GI mg and Patsy and then of course there's hyper and I should say that actually the unstructured hyper aims you saw the boom AMG can be used through pets II as well as loot really knows also but you do have other magical solvers also for special problems like Maxwell equations hdf problems and so on so there are others but I will just focus on some very more basic medical service for this short presentation here and I should say all these software provides some different flavors of multi grits so there is a reason for certain problems some of them might work better than others and if you have the right problem they all provide some very good performance so there is a reason for all of them to exist yeah and of course hyper is my thing so I will focus here on hyper so okay so let's just say hyper provides both structured and unstructured multigrid servers and our solvers actually used in many many applications here's just a few example of them I will not go into them for the lack of time but and it's actually since it's open source it's used all over the world like a lot of the other software packages also and here is the example which I was asked about earlier while I was on the panel here this actually shows you that we really can get excellent week scaling and paralyzation properties this was run on Sequoia and scaling it up here up to one point one point two five million course using up to four and a half million parallelism because we used OpenMP and these are different combinations of MPI OpenMP here the black line which gave us the best performance use 16 cores per note with 4 MPI with 4 OpenMP threats underneath so so you see actually generally what you will see kind of an increase up here because when you add more communication things do get more expensive but the the key is that it stays pretty flat once you get to a certain point so and we do see that here okay so how does multi good work so the original idea for magic code was said it actually you started it on a it started of course with geometric Marty quit and the idea was would if you look if you have a problem here in this case a duty 2d problem we look at the era on a fine grid you see something like this and when you start using some kind of simple smoothing method for example at Jacobi method or gauss seidel method after a few iterations you very quickly smooth it out like this and then if you try to solve the whole problem it takes you a very long time you will take many iterations to get the final thing smoothed down so there are brilliant idea for magic rate was essentially to say ok let's just go to a course a grid and smooth down here and so the area has to be approximated on the coarser grid we try to smooth it down here again we just do a few a couple one to two smoothing iterations and we keep going down until we are all the way down potentially just one note at one point or just a few until the system is small enough you can just solve it and then you kind of go back up using an interpolation method and that's the typical a multi could recycle their different ones but here we just focus on a V cycle in the case of algebraic logic which we do not have grits actually we only have a linear system so we have a matrix and so we only use the matrix coefficients so but we still have essentially the same idea in this case we just have to look at the algebraic error or so and so essentially algebraic Marty quit has not bill blocks its consists of a set up face and a sulfate and in the set up we are actually first with selecting the course grits obviously we don't have any grits we just have matrix coefficients so how do we do this we usually generate a graph we look at the graph of the matrix which is sort of determined based on the number of connections between points again and the coefficients we might eliminate some which we consider not so important based on the size of the coefficients and so on there different questioning algorithms I don't have time to talk into this but to get into this but that's really our main idea is that we are really trying to get the variables for the next coarser level essentially then we define an interpolation again we have to make sure that the interpolation is good enough that it sufficiently in the right way and interpolates the error when we come back up and again there are different versions we do have various different interpolations we define the restriction since very often we solve symmetric positive-definite matrices it makes sense to define the restriction as just the transpose of the interpolation operator but that actually also works because then when you define your coarse grid operator which is just the triple matrix product of RA and P your new matrix will be symmetric positive-definite again but I have to say even when you have a slightly non symmetric method of problem or whatever this approach still works very well you could of course also define the different restriction and then there's finally the surface which looks very much like in the geometric multi-code case and we essentially do a relaxation we compute a residual we restrict it down and then we get a new linear system which is essentially the error equation here we are just trying to solve the the error equation essentially here's the residual and so there's an estimation of the error and they of course this is just a too great approach we can apply the same approach as here then we interpolate back up and so on now if you look at this you see actually that here in the soft face we have mainly matrix vector multi-nation Xin here and of course it depends what kind of smoother we choose but so it is really important that we have a good implementation or a good data structure for the matrix so that's one of the things in order to get a good performance here okay so let's take a let's go back to hyper now as I said before hyper has various different interfaces right and we wanted we call them conceptual because we wanted to kind of have them in such a way that they are most useful for the user the way a user looks at a linear system because you can look at a linear system many different ways it doesn't just have to be a matrix like the row column kind of thing you could have a grid and you can have stencils associated to if you can come from finite element approach or whatever so there are different ways of looking at a linear system here and so then once based from the linear system we do have different than your solvers underneath as you can see here and they can be used through different interfaces for example P of Mg which is our fastest magic words over here can be used through the structured interface here and that's really pretty much the only place from where you can actually call it but our unstructured multigrid solver AMG or Boomer AMG which is comes more out of the linear system the algebraic linear algebra interface can actually be called through all the interfaces because it has the most general of data structure underneath and then based on this we have you see here different data layers just depending on where they actually come from so why would we actually want to use these different interfaces so for funds like I said before we want to provide the natural view for the user so that the user is actually able to look at the system the way they are used to okay we also want to make sure like if you start from a grid with a stencil and so on that you don't actually have to figure out how to map this two rows and columns of course if you start with linear algebra interface you have to this yourself and then it also helps us and that's a really important point that we can actually provide more efficient linear servers because if we actually can have a structured into a solver for example then we can take advantage of the structure and we can provide better data storage schemes and also better computational kernels so hyperness supports the following interfaces we have a first structured interface which essentially is good for logical rectangular grids then we also have a semi structured a grid struct interface that's probably our most complicated one which is for grids that are mostly structured so the example is he could have some blocks structured grid here where you have different structured grids kind of put together or you can have some adaptive mesh refinement grids or you could have some over set grids and actually we also have now awayyyyy that you can actually generate a finite element grits through the semi structured grid interface and then finally we do have a linear algebraic interface which I think that's probably that that's the one that's most used by most people so okay so let's go to the solver said you actually can use from the structured interface that's our structured solvers we have SMG and he of Mg they are semi coarsening magic with methods essentially do some semi coarsening but there are algebraic multigrid methods in a way but they're structured and they're actually existed before our unstructured one so um SMG if it's our it's a very robust solver it uses a plain smoother and so that's why it's actually very expensive but that's also why it's very robust so we actually do have people who do have problems structured problems who are actually SMG is the only multi grid solver that actually works for them even our instructors multigrid Silva who generally works more generally works for problems doesn't work for them so so SMG is very useful for them I will not look at this today however if you want to play more you will be able in the example to also try to put SMG in there but I not look at that P fmg is actually our most efficient one and used a simple point where smoothing and because of that is less robust but if you have the right problem it can solve things very fast and there are also constant coefficient versions for pfm G available so if you really have a constant coefficient stencil that solves you a lot of memory and it's more efficient than you can use that to the example problem we look at the hands-on well not it's not a constant coefficient one so that will be a variable coefficient problem so so we are not try that either okay so what about the data structure that's underneath so this is as I said before the structured good is appropriate for scale application for structured grids with a fixed sense of patterns so what you have here is you have a global dimensional index space which you can see on that figure down there which in this case we have a 2d we have tuples in 2d here and then we define a box which is just a collection of self-centred indices which is described by the lower and upper corners here so and then the extra grid is just a collection of all these boxes and then you also need to define you need a stencil in order to define your matrix coefficients and again this looks like constant coefficient here but you can define us as variable stencil of course we have real coefficients then sir okay so for the actual data structure we have a stencil and then we have defined a grid box so where the user would have to define the grid boxes and the stencil essentially when you in this case we have like two boxes which is just define the following layer and then underneath what happens for the data structure is there is a data space generated when it's the grid boxes plus the ghost layers around and the nice thing about this is these are important because they transfer information across these boxes so they are necessary to communicate between the boxes even within one process but also across processes and then the actual data is stored in the following well that for each stands entry like all the stencil entries are contiguous stored together for each box here so so that's the way this would be stored for this five-point stencil here and the consequence of this is that if you actually apply operations to this first of all because we have all the ghost layers including the data space when you actually have a loop going over the box you don't have to have special treatment on the boundaries which makes this more efficient right so you don't have to do this if and what not it's just you just apply to this whole thing but the other nice thing about this if you think about these stencil entries if you look at it in terms of actually matrix and rows and columns it turns out that these actually correspond to the diagonals of the matrix and if you know for those of you looked into GPUs you know that the diagonals storage is considered to be one of the best way to actually do this on a GPU because it does give you the best vectorization so this is a very efficient data storage scheme okay so now let's go to our unstructured solver boomerang G which is probably our most used magic resolver it's it is of course it's just used for general matrix method it's it has have a lot of different parameters that you can use and I know some people are not very happy about this so we have now our newest default we have tried to change the default to be actually one that works generally pretty well although you might still have to do something as a matter of fact our default version right now for the problem I'm going to show you is not the best one so I will have to show you how to actually prove this for this case but we still try to make it so that users don't have to play too much with trying to find one of the many different options that are in there I do think we have a very good corseting and interpolation scheme for in there for right now so again it does have some other options which we're not going to talk about either but yes it can also solve systems of P DS but you do need additional information and I already said that before okay so if we look at the data structure underneath here so I should kind of think sherry because she already introduced csr or she called it CRS data structure so you already know what the compressed sparse Road data structure is so it is called Parsi as our matrix because it is based on CSR data structure essentially we would distribute this across different processes in the following way by rows row chunks in this way so the matrix o would have the complete rows for on processor 0 processor 1 and so on and but then it does consists of two CSR matrixes so the first one contains all the local coefficients which are connected to those column energy referring to the local rows like so all the coefficients nonzero coefficients I should say so it's not a dense matrix it's just the nonzero parts would be contained in the first CSR matrix the local one and then there's a second one which contains all the remaining coefficients outside which connect to our processor rows and then of course because we do want to compress this as well we do need a mapping that map's between a local and column local and global column for those ones here so so that's so we have that too so one of the problems of course compared to the previous data structure that we saw before is this actually requires a lot of indirect addressing there's a lot of integer computation going on and computation relationships between processes also and for the parallel which makes us very complicated which there are certain things that are just implicitly there when you have a structured data structure you just know things you don't know it and unstructured so there could be anything it could be anywhere so you have to be so this makes this actually less efficient than the structure data structure but that's what you have to do if you have an unstructured problem you know okay so let's talk about complexity issues here so one thing is the if you try to do the course code selection which you do know the way we do this we use this galerkin product our AP that actually can if you multiply 3 and sports matrixes with each other what happens is the original matrix might be very sparse but the stencil size starts to grow right it turns out this is not really so much an issue for our structured solvers because they're actually the strengths Crow's is limited so in a sweety case where if you start with a seven-point stencil so in the worst case we get to 27 points on the coarser grits and that's where it ends for our structured solvers as a matter of fact for P of Mg we even have a non galerkin's type version where you can keep the same number of points so you can keep it on 7 points all the way down and so you don't have that issue but in the unsorted case you cannot rein this in so what happens then for example and here this chose the communication pattern but it's kind of nice because it also sort of shows how the matrix develops what's different so of course when you look at this this just shows you the communication across what happens in this case it's just 128 coarse imagine when we do it on a million cores right but here and we start this is actually not the finest grit the finest good would only have seven seven points right so you essentially have 128 this way they're sending messages this way so that shows you the the communication pattern on the second level third fourth fifth six seven eight nine and so on and what you see what happens is communication grows the actual stencil sizes grow like the number of non-zeros per row increase when you go further down of course the matrix gets actually smaller in size so that part is not really clearly seen here and in some case you might also fill up in this case the whole machine but eventually processors will drop out that's why down on the very bottom you don't have this problem anymore but that's that can cause a lot of performance degradation so that's one issue the other thing is we do need to really look at complexity's so that's a thing that has been looked at for much al-jabbar market all along and it's generally defined as the number of non-zeros for all the broth the apparatus AI which we generated on each level divided by the number of non series of the original on matrix so generally we would like this to be less than two and close to one right and the reason we want us to be small is because this certainly effects of flops as well as our memory you know so that's really important we also have another complex we look at I call it memory complexity that's not really a common one but when we actually look at the boomerang output it will also tell you memory complexity and what's different there it actually adds the number of non-zeros for the interpolation operator here so so you get that additionally to to get really some more idea of the actual memory so one way to kind of deal with these complexing is one thing we've been dealing with these single a lot they will also help to improve your performance and so on is to use more aggressive coarsening so one of the things we will do is actually we do aggressive coercing which means we essentially coarsen twice now in order obviously that means we also need to use a different type of interpolation a more long-range one to make up for the fact that now our points are further apart but that actually helps a lot to really improve our message here to really improve our performance and then of course we do do some other things i cannot talk about all these things on trying to reduce communication and so on so we would just stick with us for for you if you actually want to use AMG you want to make sure you look at the complexities you can print them out and then you also will to make sure to deal with that if it turns out to be really bad okay so then I just have to say one more thing and I know you've heard this already from several people Barry mention it first and some others it's really important to actually also that algebra magic word is often used as a pre-conditioner as a matter of fact to Krylov methods and so on and this helps actually to additionally improve performance to actually get your solution faster because somehow especially when you use aggressive coarsening you should really combine us with a cry love method and so they sort of help each other to really give you better performance and better faster times less iterations so okay so now it's time to actually start looking at our hands-on exercises so we will actually look at a problem at this equation that's look at a linear system that's generated by Emmerich's through this equation here with spiritually boundary conditions our grid will be a 128 cube size grid which is actually blocks structured and consists of at least 8 sup grits and explained that this morning if you would use more than 8 processes you would get more we will only use 8 processes for our hands-on and the right-hand side is is here plotted for this and if we use this right-hand side we get the falling solution this is a variable so okay so that's that this is just our usual statement all right so let's just go to the hands-on now okay so I'm actually already in the right directory so let's just take a look that's the director you guys have to go down here it's under the hands on M Rex AMG okay and then let's take a look here at the alright so as I said already this is the problem we're going to look at and so this is an example input file here which has a combination of a lot of things you don't have this file but this is something you could look at if you wanted to play around later it does give you some more options up here this is where you set the number of cells and in our case that's 128 cube we will also look at a larger problem of 256 cube here and then which also has a different number of grid sizes the tolerance we use as ten to the minus six for the turret solvers here okay so let's just first solve this problem using PCT and see what happens okay all right what we see here is we see it took 213 iterations and it took we will only look at the real number here for the time at 3.2 seconds we're not worried about the other two because there are some other timings we are not interested in so so now let's see how what happens if we actually try to solve the whole whole thing with P of Mg now I should say for you I put several input files for you in and they're already named the right way you can later look into them just for a purpose of keeping this shorter okay so let's do P FMG okay so what do you see here you see now it took 22 iterations so it's 10 times converge 10 times as fast it took about half the time so what does this actually mean for your actual iteration you can see it's actually quite more expensive to use P of Mg versus CG once each iteration is a lot faster than PCG so so now we should also combine this actually with as a precondition or two PCG as I said before to see what happens if we use P fmg as a pre-conditioner for PCG okay so what do you see now we get an additional improvement in time as well as in iterations it's now twice as converges twice as fast and we get a little additional improvement in time again one iteration now is even more expensive because we combine a CG portion with a multi-grade cycle so but but overall this gives us our best time so far one point two four seconds compared to the original three so it's almost three times as fast that's just using CG okay so now we want to try to do this using a bigger problem sighs okay so because I wanted to see what happens when we actually scale this up so let's run CG first that takes a little bit cuz this is really slow now it's about eight times as thick but um so yes so it will be slower that's why I already reduced the tolerance to make sure that it should take about 40 seconds okay 42 seconds about and it took 440 iterations so it's converges twice as low as before let's let's run PF mg PCG coups okay and see how fast that is okay now you see that took only 11 iterations so what do you see before it took us 10 iterations and now it's 11 so if you compare the two in the one case the number of iterations doubled when we increase the problem size in here we only add one more iteration so it's almost constant really you also when you compare the timings in one case it took 42 seconds the other one 7 so it's a factor of 6 before we had about a factor of 3 so you can kind of see how CG is starting to edge up whereas P fmg doesn't PF mg PCG doesn't you add up as fast so that you see there's a different types of scalability now this was not a weak scaling because to do weak scaling I would have had to have 64 processes to run it on a problem 8 times as big I don't have so ok so um let's now look at what happens if we actually run a mg on not on this problem on a smaller problem so we going back to the smaller problem size and we run AMG PCG that's the unstructured one ok okay so first of all this prints out a lot more stuff so what do you see here for oneness we have this problem size that's 2 million here you see we actually have a seven-point stencil on the finest grid and then you see the stencil here which says maximal stencil increases so it definitely goes beyond 27 you see you up to 185 for your in the maximal case further down and we have a total of 8 levels here down here's interpolation and then when if you look at the complexities here we have a complex of 3.2 now remember what I said before what we like to see 4 complexities we like to see something closer to 1 and this obviously is so that's bad so the number of iterations was 10 that similarly appears it's to the PF in G case we had the same number of iterations but they took one point two four seconds here takes four point seven two seconds so this is even slower than the PCG one we ran for this particular problem that's not good so what happens that's why I said in this case we do need aggressive coarsening so remember we have eight levels here and the complex City were three point to about so let's run it with aggressive coarsening here let's see what happens okay so we see here now it only took seven levels so we essentially skipped a level almost right we still have some big stencils here but overall it's not quite as bad if you look at our complexities now it's one point three six instead of three point one so it's much better so we're saving a lot of memory actually and our time is now 2.2 it took us more iterations so that's what happens when you use the grass of coarsening usually you will have more iterations but it's it's much better so it's still slower than PF MGP CG which took about 1.2 for for this problem but it's it's a much more reasonable performance and it is definitely better than the original just CG alone if you would scale it up further and run a big problem for this then you would end up seeing that P fmg still will scale better than a mg but overall so the point here essentially is if you do have a structured problem and you are able to use a structure solver that works well you should use a structured solvent not the unstructured one but if you have an unstructured problem AMD works quite well too so if you need to make sure you look at the complexities and make sure that you set the right parameters in this case and add aggressive corseting so that's pretty much all I have to say and so we have time for a quick question for a week as they go down into the coarser coarser grids they just go idle as their degrees of freedom leave or do you do some sort of consolidation no for right now we don't so they just go idle like you saw on that communication pattern you saw how did it actually there suddenly we're no more communication on the low level so correct and then a lot of them don't so at the very bottom when we are really small we actually do consolidate but that's like a system of eight or so so there's not much communication we do have an option where you can actually put them all together on one process at a certain point sometimes that helps but it's not a real real load balancing [Music]

Info

Channel: Argonne National Laboratory Training

Views: 1,257

Rating: 4.8095236 out of 5

Keywords:

Id: QIT85rck79o

Channel Id: undefined

Length: 35min 17sec (2117 seconds)

Published: Mon Sep 25 2017