PETSc Tutorial | Lois Curfman McInnes, Satish Balay; Argonne National Laboratory

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I want to acknowledge Jed brown matte netley car up and Barry Smith for their contributions by providing slides for this tutorial material present that we're presenting today this is really just a glimpse of some of the functionality within Pepsi presented at a rather high level given our time constraints but I encourage you if you identify a particular aspect that's of interest to you and you want more information please go to the website here and you'll see there's a variety of other more in-depth tutorial material and also examples so we can help cetacean I can help guide you to where to look for further information so today we're going to follow this outline where will first talk a bit about the philosophy of the software package overall in our design and our approach for achieving good performance on extreme scale machines will then talk about the fundamental data objects used in Pepsi vectors and matrices and upon those data objects will then build to discuss linear solvers in particular in particular Krylov subspace methods and preconditioners we'll talk about nonlinear Sauber's and then building up on that time stepping methods will also discuss up optimization solvers within the towel component of Pepsi then if time permits i'll talk a bit about various support we have for handling topological abstractions in particular through the distributed array component that the concepts handled there are are important but yet takes a lot of time to go into them in depth so that might be an area where we discuss things offline or refer you to other areas for for more information and then finally Satish will talk about using debugging and profiling capabilities in the library to help understand performance so so first of all it's important to note that that the Patsy software package is really intended as a platform for experimentation we want to enable it to be very straight forward to for not only ourselves but the community to use our software to experiment with with models with different discs ritas ations different solvers and different algorithms and of course the boundaries between those are very blurry the books shown here demand a composition was written by Barry Smith Petter George said and will drop the reason the Patsy library started is because bill and Barry were trying to collaborate and share information about domain decomposition algorithms but at the time when they started in the early 90s there really was no software available that enabled it to be easy for them to do that so that is what spearheaded the development of the Patsy software package from the start and I joined them to work with them in 1993 they started doing Pepsi capabilities in 1991 and we've been continuing thereafter Patsy stands for the portable extensible toolkit for scientific computing by portable I mean that we aim to run on a variety of different kinds of architectures including extreme scale machines that are the emphasis of this program but we also are very committed to supporting the whole range of software architectures and machines from individual laptops through small clusters of workstations and whatnot in fact many many people do most of their development on those kinds of machines and then only at the later phases of their work run on the extreme scale machines so that whole spectrum is very important to us we support a range of different operating systems the software is usable on C C++ Fortran and Python and we support a variety of different Precision's both real and complex arithmetic we support solution of up to very large system sizes within billions of unknowns with very good scale scalability and again emphasizing that the software is free to everyone to use we have a BSD style license so we encourage people to use the software and also we follow an open development model so many people in the community contribute to the software and we invite and if you interested in doing that also to do that our website provides information for how how to proceed so the toolkit really emphasizes extensibility as I'll explain with the design of these various objects really the intention is that everything has a so called plug-in architecture so that in addition to the functionality that's already in the library someone else can provide additional implementations and then they'll be available for use using the same abstract interface so an example where that might be interesting is for example if a vendor wants to supply a particular implementation then that would be available to users our approach here being a toolkit is that we provide not only algorithms and implementations of those but also some capabilities to help with debugging and low overhead profiling and Satish will explain some of those near the end of the presentation we really do emphasize composability as Laurie mentioned there's lots of different aspects of solving problems and it's important to be able to combine different algorithms and data structures together to try to compose things together to handle especially the large multiphysics problems that are of strong interest on extreme scale computing we know that it's impossible to select the best algorithms and data structures for your problem a priority so we really emphasize exposing what we call an algebra of composition so that you as a user can pick and choose from among various capabilities and then can experiment and refine as you move along so the solvers are designed to be considered in some senses decoupled from the physics and discord ization so you can use whatever mesh management and discritization infrastructure you would like to use and you'll hear about some of those from from our collaborators and fast mathis after this afternoon and then you can interface to pet sea for the solver aspects of what you're doing it soak it is used by a wide range of computational scientists this slide just provides information on a few of those projects that that are built upon using pet sea and I want to emphasize that there many many tutorial style examples provided with the library for the various aspects of functionality through the website it's very easily to explore and to find working running code for for many kinds of problems and then it's possible to start for it for example using that code in your context or your special context by replacing some parts of the existing working code with with other parts that would be unique to your implementation so oftentimes gives a very very effective starting point for for working on new applications so this is restructure is intended to make it easier to experiment and develop new capabilities but we still need to acknowledge that that working on large PD problems is a very difficult endeavor and so the Pepsi software is not a sort of silver bullet as stated by this quote by Barry Smith so many of you may know that Barry is the lead developer of Pepsi and he's really the spirit that has continued to provide guidance for the project of raw we have contributions from for many many people in our development team and through the community but very really has provided a lot of sustained guidance and design so it's possible to obtain patsy by using our git repository you can also download a tarball if you like and once you download the package you can configure it with various options including installing some of the external software that we talked about today so it's possible as shown here to configure the software so that you could download it install ml hyper and super oh you some of the other complementary capabilities provided in fast math software developed by by our friends at other institutions so that you can experiment with with different algorithms at runtime so most of these external packages can be automatically downloaded and installed at runtime here's a list a partial list of functionality this certainly is not a complete list but its intended to be a way to make it a bit easier to to use tools developed by different groups so as I mentioned we use a git repo to manage our software this is a graph that shows the contributions to the repository from 1994 through 2016 basically just showing were a very active project with many contributions and we welcome yours this diagram shows the structure of the Pepsi library overall it's organized in a way that that is consistent with how many algebraic solver are organized that is we build on top of computation and communication kernels at the lowest level I know you've already heard lots of great information about MPI and and other communication facilities those are our foundation and we use those built on top of that we have a very lightweight profiling and interface that understands the various objects and design of the library that can help application programmers understand how their programs are organized and where the time is being spent throughout those layered on top of that we have the data objects that are used for distributed computing for matrices vectors and also index sets or indices indices indicate particular items within a global problem that would be local to certain processors as Lori indicated when we're dividing problems across multiple processes we need to determine how to communicate from one process to another and index sets or a way that that we help to handle that in Pepsi built on top of that we have linear solvers which consists of Krylov subspace methods and preconditioners on top of that nonlinear solvers which typically require the solution of linear systems underneath not always but often do so there's some that would work directly just with vectors and matrices as appear with that we have optimization software and the tab judge that that likewise also builds on the linear solver capabilities and then on top of all of that we have OD integrators which can then interface to the nonlinear slobbers in linear Sauber's and whatnot as needed so during our presentation we'll talk about how these various aspects of functionality work work together as I mentioned our libraries are written in such a way that users don't generally have to work directly with mpi so through these data objects and algorithmic objects you as a user can rely on those to handle a lot of the inter process communication that that's used the algorithms as we write them are what we call data structure neutral and they they access the communication through through the objects for vectors and matrices this this diagram shows how one might think about interfacing between code provided by the application or user as shown in blue and code provided by the Patsy library as shown in the intermediate green box and the sub boxes within that so typically we would see a user's application interfacing with time stepping code which then internally would would use whatever capabilities are needed for nonlinear solvers linear solvers and whatnot and then from within the nonlinear solvers for example the the package would then provide control back to the user for things like evaluating a nonlinear function where you're doing the real meat of your work to describe the physics of your problem I'm and likewise also evaluating a Jacobian if you're choosing an approach that needs derivative information so we'll describe in more detail how we handle going back and forth between those various layers so for each Patsy program you begin in your application space by initializing Patsy and this really just gets our data structure set up it can did determine set up for mbi if that hasn't been done already and you can use whatever subset of processes that that you find appropriate for that aspect of work so we can either run across all processes or across a subset you might be using another subset of processors is for something else at the very end of your program you call pet see finalized which will calculate a logging summary it will shut down and release resources and then depending upon what options you activate it can can help to check for unused runtime options memory leaks and whatnot each object or each layer of functionality and an object has a basic interface as described by this this slide so conceptually whether you're working with a nonlinear saw or preconditioner or whatnot you create the the object you can name the object set the particular type of object you can set it options prefix that would enable you to tag that particular object in order to set runtime options unique to that you can provide customization from the command line setting runtime choices for example for convergence parameters whatnot you initialize the object or the setup call you can view the object to understand the details of what's happening inside it and finally destroy it when you're done so all of the objects support this basic set of functionality and they all support the help option so let's start now with vectors and matrices so Pepsi vectors are fundamental objects that represent field solutions right-hand sides and other kinds of functionality that makes sense to store in a so called vector each process in a parallel setting locally owns a sub vector of contiguous global data so the commands for creating vectors are shown here you do a generally a global creation command you can set the size that's owned by each particular processor can choose at this particular type of vector and set various runtime options so from within that vector you as a user can directly access the memory you have direct access to the values you can do various vector space operations such as dot plow products norms what what not and there's functionality to help communicate automatically when doing parallel assembly I will talk a little bit more about that in the next slides also there's custom communication available for scatters between different vectors so again this is just emphasizing the concept of collectivity in parallel MPI communicators specify the processes that are involved in any particular computation and so that is indicated at the time when you create a vector or any other object and some options and some operations are collective such as doing a norm for example and some are not collective such as getting the local size of a vector so if you're doing sequences of collective calls they must be done on each process in the same order the way we assemble parallel matrices and vectors is quite flexible so processes in general can set whatever entries they they want to send they don't need to set only those that are owned by their local part and Pepsi will automatically communicate data as needed if that is necessary in an assembly phase so in in more specific terms for the case of vectors there's a three-step process so in this case on each processor code would either set or add values to a mate to a vector would initiate communication to begin sending values to other processes if need be and then complete the communication so that's done by these sequences of calls of exit values indicates whether you're either inserting values into a vector or adding values typically for example if you're doing a finite difference discretization you'd be inserting values often times with with finite elements it's more natural to use add values where you're building up contributions from particular elements and then after that phase we do the vet assembly begin and end in order to do the communication that may be necessary for vector entries that are not assembled locally so here's an example of one way to set the elements of the vector in this case we determine the local size of the vector and we determine the rank of our processes and here we're saying if our rank is 0 then we will set values for our whole vector and then we'll call the assembly routines so this is a simple way to get started with the program for example maybe on the first day when you're writing some code but as we can all tell it's it's not a good choice over the long term because for example if you have a thousand processes 999 of them would be idle well process zero is setting values into the matrix or the vector rather so in general a better approach is to have each process set the values for this segment of the vector that it loans owns locally and that's what's shown here where we're determining the ownership range of this segment of the vector for each process and having each processor do its own local part and then finalizing the assembly now this this approach looks very straightforward and simplistic but it really doesn't acknowledge what happens in most codes where in most codes they're typically some values that that are not locally owned I shown in this diagram here when we're discretizing we we need to compose for example a mash across multiple processes as represented simplistically here for both a structured case on on the left and an unstructured case on the right and typically when when assembling values for example to compute the nonlinear function for a look for a physics problem you would be looping in your user provided code across the segment of the the problem that your process own so in a finite element sense you would own a segment of elements for each process such as those shown in the red dots but as your computing you would need information for neighboring nodes that are that are actually stored and owned by other processes as shown by the blue it's blue dots those are often called ghost values or ghost nodes so during the assembly process what what can happen is you can assemble based on on what is natural for you with your partition and partitioning say of finite elements and any entries that are not locally by you would then be communicated during the vaca assembly begin and end stage and they would end up where they need to be so you don't need to worry about that explicit message passing now the part of the pets a software package that helps to deal with this transitioning between sort of local and global views of meshes is is the data management part of things and that I will talk about if we have time towards the end of the presentation so basically we just have capabilities that can help to automate the transition it's needed to go between these local and global representations so sometimes it's more natural for you as an application to work directly with the locally owned arrays of your vectors and these capabilities are possible by directly accessing the arrays this doesn't do a copy this just gives you a access to the data in your vector array through vet get array then you can can add values or modify values and when you're done doing that you indicate so by calling that restore array that makes it safe so that you as an individual process can modify elements and then notify the rest of the the computation when that is finished so matrices work in a way that's that's similar to vectors in general matrix is a linear transformation between finite dimensional vector spaces and by forming or assembly a matrix what we what we mean is that we're just then able to define its action in terms of entries of the matrix which typically is stored in a sparse format for PD based problems in pet see you create matrices and use them through this sequence of commands you first call Matt create to indicate the processes involved in the computation set the various sizes through the locally owned parts determine the type set a variety of options and optionally you can pre-allocate matrix memory which in practice is very important for good performance can determine block sizes if you have multiple degrees of freedom / node that's a very important aspect for performance and likewise that you can do Matt set values which is analogous to the VEX set values so there's a single user interface for matrix function functionality but multiple underlying implementation so a common one is what we call a IJ and storage by the non-zeros per row there's block variants of that symmetric variance there's also dense storage formats and so-called matrix free matrices where the actions of make of the matrix on a vector are defined but we don't actually store data for the matrix entries themselves so an important concept is a matrix in our view is defined by its interface not by the specifics of the data structure that are used for storage this diagram here shows the way we split parallel matrices across processes so each process owns a sub matrix of rows as indicated here where we have six processes and in practice the storage is broken down by a diagonal part diagonal blocks which are shown in blue and the off diagonal blocks so when we're working on assembling matrices this is sort of the layout that we mostly use we're not going to go in detail through matrix assembly I will just say that it is a similar approach to what we described already for vectors that is loop over parts of your of your rows and insert matrices so this particular example showing how you might do this if you were inserting all entries from the rank 0 process which might be a good way to get started when you're first doing an application but of course it's not what you want to do over the longer term this other slide shows what you really want to do is generally loop over the segments of of the global problem that your local process owns and set values so an advantage of this approach is it's much more scalable and in the code still this is quite straightforward and simple similarly to as we described for vectors then that assembly begin an end phase handles any communication that would be needed so you're not required to set every single entry locally on the process that owns that sub block and in practice most application codes do you need to transfer some data during during them that assembly in begin and end phase but you as an application programmer aren't dealing with the details of those MPI calls that's all handled within the matrix assembly begin and end so next we'll talk about it err ative solvers those build on top of the vector and matrix data structures we talked about some matrices having explicit storage and some not even if you don't explicitly store a matrix you can still do cry love iterative solution methods Jim talked about this earlier today the key point here is that we build up cry lab subspaces through matrix vector products and because of that we only need to pay trics to provide that functionality that is the matrix vector product functionality in order to be able to write a whole variety of algorithms that are that are cry love methods there's no way a priority to know which particular cry love method will be most effective for a given problem our approach in petsi is to through this this simple interface provide access to many different algorithms so in order to activate the cry love methods if you're working directly with a linear solver you simply set the operators that you want to use by calling KSP set operators this indicates the matrix that that defines a linear problem you want to solve and optionally a different matrix to use as a pre-conditioner we'll talk more about preconditioning in the next slide and then we call KSP solve to solve that system where you indicate the vector of the right-hand side and also the solution vector and from that you can access the preconditioner and you can change to activate a variety of different algorithms on the command line so this slide shows a range of some of the solvers we don't have time in this particular tutorial to explain the theory behind all of these but some of the slides online to explain the algorithms and more depth for those of you who are interested in more details I'll just mention that the GM res family of algorithms is often used in practice in many problems that sort of default solver but there are many many others that are available the the green link line there if you click on that in your PDF I it will give you a hyperlink to a table of all the solvers that are accessible and available through the Patsy libraries so preconditioners can be incredibly important for the practical use of crawling methods for many many problems so the idea Lord Lori introduced this is to improve the convergence of the cry love method by applying a pre-conditioner to the problem so that is we we use another matrix as represented here by by P where the product of AP is more amenable to convergence than just the original matrix so in practice we don't typically form these products of matrices but rather we apply different algorithms to to use the operator of the inverse operator of the matrix through the preconditioner so oftentimes applications will use a pre-conditioner that is slightly different and cheaper somehow than the full linear operator that defines the problem and figuring out the the exact way to do that most effectively for a given problem is really an art there's lots of problem specific information that can help make good choices for what to do and some of our other speakers today we'll talk about that so this is a list of various preconditioners that are available in pet see now Jacobi's is simplest I'm sure many of you have already become familiar with many predicted conditioners jacoby is just diagonal scaling lock Jacobi is conceptually the same thing but extending two blocks and that's often a first choice for preconditioners in parallel because it's very very simple to understand SOR additive Schwartz incomplete factorizations multigrid physics play splitting approximate inverses substructuring and matrix free approaches so again we don't have time to go through the the algorithmic theory of all of these but i encourage you to listen carefully when Rob they'll go and sherry lee and other people speak later this afternoon and they'll talk in much more depth about what the algorithms are and how important it is to make good choices for for algorithms and implementations just like to emphasize that the so-called matrix free approach here means that you as an application can provide preconditioner functionality that just adheres to our object interfaces but is implemented in some specific way that exploits problem specific information and so you can then more easily provide what makes sense for your problem but also have the flexibility to experiment with and compared with this that's range of other functionality that's in the library so we don't have an opportunity to explain multiphysics type preconditioners in detail but since this is one of our newer aspects of functionality and it's very important for for problems especially as we're scaling up to extreme scale I wanted to just emphasize our field split type of preconditioner which is intended to make it easier to tackle problems where you don't necessarily want to treat your whole global problem as one but you want to exploit the fact that you understand you're bringing it together multiple parts that come from different physics or different aspects of modeling so we support a variety of approaches here there there's more specific information online that that can provide details about that and also some examples that explain how to use it so part of what is important to emphasize here is that we have the ability for codes to compose different preconditioners including composing block preconditioning and multi breed preconditioning as needed and so this custom choice of putting together different kinds of algorithms can happen at runtime by activating command-line options and the code does not depend on specific details of the matrix format or whatnot so there's a lot of flexibility now built on top of linear solvers we have nonlinear solvers this is a list of the method supported in passing how basically in addition to the methods listed here you can also define a shell nonlinear solver which can be whatever you as an application person want to implement this nest component is our nonlinear solver component where the runtime options can be provided by the application to to make choices for the type of solver the convergence tolerances you can activate various kinds of monitoring so that you can monitor convergence view the details of the algorithm etc and this same kind of approach applies also to the lower levels that we just described that is to the preconditioners and cry love methods and the other algorithmic components they all work in effectively the same way so Newton methods are a workhorse nonlinear solver and the mean thing to understand from from the perspective that this presentation is that the user code here has a nonlinear function and you as an application are responsible for evaluating that nonlinear function and then our code for the nonlinear solver will handle everything else so going back to our picture here of the flow of control of an application this is a case where the nonlinear solver needs to interact with the user code for function and possibly Jacobian uation so we have callback functions that handle that so when using a nonlinear solver you call sniff set function to indicate the code that will do that that function evaluation and likewise the Jacobian evaluation and you can provide whatever information you need through a context variable so the Pepsi software never sees the application data or that other information is just a pointer that's called that's passed so that you as an application can do whatever you need so here's more specific details on how the user provided function actually would look on your end in terms of implementing it so we take as input the current solution vector it's out the residual or non-linear function again the user context it passes application specific information likewise the Jacobian routine that you as an application provides has a similar setup there where as input you take the current solution vector then we indicate data structures that would be used to store the matrix and optionally a different matrix to use for preconditioning it's not required to use a separate matrix but but some some applications choose to do that and here you also can indicate whether the nonzero sparsity pattern of the Jacobian matrix will stay the same throughout multiple non later iterations or will be changing because we can get better performance if we exploit things that that remain the same throughout multiple iterations so as alternatives to the application computing a Jacobian there's a built-in capability for a sparse Friday find a difference approximation using so-called coloring so what that does is makes multiple calls to your function evaluation code in order to build up the Jacobian information that's a great way to get started if you want to use a method that requires derivatives but you don't have derivative code handy because writing it as any of you have done it no it's very tedious and labor-intensive and error-prone there's also support for automatic differentiation could be cape which can provide code based on your function too can you compute derivatives so now layered on top of the linear nonlinear sauber's our time integrators this slide shows the various OD form supported so we can think most generally by the top equation in there that shows we have a function G that is a function of time and you're known X and also the spatial derivatives of that X dot is equal to some other function f which is a function of time and the unknown X sometimes in more simplistic form we consider a mass matrix that should by the third equation there m x equals f and if there is no the mass matrix is just identity then we have the simplest warrants shown in the bottom equation so the user who is solving an eau de then provides these callback functions for time integration using the the three commands below well so basically they're going to be providing a right-hand side function that would be to evaluate the function X part that's a part where F indicates the part where you would generally be using an explicit type method don't don't necessarily need to use an implicit part whereas the G part on the left hand side of that first equation is the part of the function where you would want an implicit solver where for some reason you need implicit algorithm so we can provide both fng and then optionally also the Jacobian of the function G so I alluded to explicit and implicit methods carrollwood we will talk about this in much greater detail but basically explicit methods for time integration are are relatively easy and accurate but it challenges that they must accurately resolve all time scales so they're not good for for stiff problems on the other hand implicit methods are much more robust and can handle stiff problems but they're harder to implement and they require linear solves so the combination that that blends these two together is called implicit explicit methods and so they enable you to tackle different parts of your problem that are amenable to explicit and implicit approaches so these are often good choices for multiphysics applications where you have different kinds of behaviors and different reasons of your problem so the interface that I showed here is is set up to to make it easy for you to to do that in your application code this slide shows just a few of the examples of time stepping methods that we have in your PDF file if you click on the green highlighted part at the bottom you'll see the full list of option so there's a variety of run to Greg ekkada methods that are really the workhorse of much of what we do and then a variety of other choices here we won't go into the details due to time constraints part of what's important to understand is that you want to think about what you're doing holistically this diagram here shows that in practice you have many many choices at all layers of the algorithms that you're using for a problem we advocate that applications interface to pass eight the highest algorithmic level that makes sense for your problem so in particular many cases interfacing at the time stepping level is the best choice and then underneath as you can automatically do some smart things to pull together other aspects of algorithms and data structures but yet at the same time you as an application have the ability to customize choices at runtime for for those lower layers as well so it's important to keep in mind that experimenting with algorithms is is very very critical there's no way to get optimality in what you're doing without an interplay between the physics of what you're modeling and what the algorithms are set up to handle well so we can do that by using flexible programming it's important not to hardwire algorithms and data structures but really set yourself up so that you can make choices as you learn more about your problem as you evolve your problem to tackle harder harder physics and other aspects of functionality another key point is that it's it's very important to to work with your real code that has all the gory details of what you're modeling it's also useful to experiment with simplified models but we really are set up to handle things that that are much more more complicated and the code is really at any I'm going to run much better if you're if you experiment with those choices in the context of your full simulation as satish will discuss in a few minutes it's important to profile your code as you're building it up and working with it so within pet see there's capabilities to automatically build up that profiling and use it very effectively so we've already talked about how petsi is structured so that as an application code you just call it as a library so it doesn't seize control of your your main program and it doesn't control the output it propagates everything up to your main program it also propagates errors from underlying packages and then you indicate the communicator that you where you want to call the library so it's possible if you have an existing code that you might find it useful to incorporate existing data structures and algorithms that you already have for functionality so that you could say compare those with some of the others provided in the library that is that is all possible using the various capabilities for so called shell matrices and shell preconditioners so that would enable you to wrap your existing functionality and then call it through through the Pepsi interface so for example if you have a matrix data structure that you want to use you can then call that through the higher-level nonlinear solvers and whatnot by using this approach we can also directly use memory that you've allocated for vectors and matrices by calling these commands here for example that create MPI with array will then not allocate fresh array space but we'll use the array that you provide to it pensee provides uniform interface across multiple languages again you have a choice about the level of abstraction that you use when you're interfacing to the software hire is generally better from our perspective so along that net vein it's great to take advantage of other software packages out there that already provide higher levels of extraction and here's a list of some functionality provided by external software teams that that can leverage Pepsi so for finite element modeling finite volume zeigen values wave propagation micromagnetic etc there there are a lot of choices and it's great to take a look at what's out there and from those you can still have access to all the algorithmic customizations that we just talked about here one aspect that I'll talk about in more detail next is numerical optimization within the toolkit for advanced optimization or tau well previously this had been an external package is now considered to be part of tau a part of Pepsi rather it's distributed with pet sea and the developers for that functionality or Todd Munson Jason sarish and Stefan wild so the towel package focuses on nonlinear optimization problems it can optionally support variable bounds or complementarity constraints and some support is also provided for PD constraint applications so basically tau provides a suite of iterative nonlinear optimization algorithms generally each iteration computes a search direction and then the function values ingredients are calculated until various conditions conditions are met for convergence newton tight methods quasi-newton methods conjugate gradient methods derivative free methods are all provided and in different circumstances various choices make a lot of sense one area that has received a lot of attention attention lately or derivative free methods which would be a good choice if you have a very expensive function evaluation so for example if your code for a function evaluation takes a hundred hours to run you probably don't want to compute gradients which are first derivatives and Hashem's which are second derivatives because the time would be prohibitive so in that case the derivative free option may be a good choice to consider here's a list of the various offers available in tau and the manual pervade the manual pages for towel is provided in that link have a lot more information so I'm just going to briefly touch on distributed arrays which represent our topologies abstractions we talked earlier about building up matrices and vectors and we need some way for those linear algebra objects to talk two meshes or grids so a DM is is the part of petty that handles that and it provides a way for interfacing between the so called grid and the vector so it can provide support for helping with parallel data layout refinement in course corseting and transfer of ghost values as well as matrix memory pure allocation so briefly we have support for a couple different approaches DM da is the support for Cartesian meshes so these are logically regular meshes that are very nice for simple finite-difference problems and some of the examples that you'll see today use those the mplex is more general to box apology that can handle any dimension it's more complicated but but more flexible damn composite enables us to pull together multiple das this is important for multiphysics support and then we also have DM network which is useful for discrete networks such as power grids or circuits and DM Moab which is an interface to the Moab unstructured mesh library that Vijay will talk about this afternoon we don't have time to go through these in detail but we'd be led to chat with you offline or you may want to go to the manual pages as indicated here so basically the distributor ray which is our logically regular support enables us to help handle transitioning between the parallel layout and the local layout that's that's used for evaluating say nonlinear functions and it also can help with refining meshes and hierarchies so we looked at this this diagram before that shows what we mean by local nodes given in red and ghost nodes given in blue and I won't go through the details for how the distributed away our right works but on your own time you can take a look at the slides basically that the main point is is the distributor raised helped orchestrate transitioning between a local view to each process and the global view which is used for our vector and matrix and solver objects next we'll just talk about performance so we've gone to great effort to put together lots of functionality for solvers that we can compose in different ways how do you determine how well you're doing I expected in earlier parts of this program you've already talked about scalability where these diagrams show you know what's looking good for strong scalability as well as weak scalability and depending upon what your goals are you know you want to be considering both of those one thing to keep in mind is that here's a quote from Bill grop who I know spoke earlier this week billables a founder of our Pepsi project and you know he pointed out that the easiest way to make software scalable is to make it sequentially inefficient and of course we don't want that none of us wants that we want efficient software so in order to be able to really function efficiently and understand what your program is doing you need a performance model for for what your you're doing that takes into account memory bandwidth latency various algorithmic parameters etc in order to fully fully understand where to focus your time when you're we're trying to do better with your with your code it's important to have a perform model so we don't have time today to talk about that or go into details at all but there are some slides online at the Patsy website if you're interested in learning a little bit more about performance modeling I know some of our other speakers later in the sessions will address that so I just want to make it clear that's important for you as an application person to understand the details of what what your code should be doing so that as you begin to profile and understand performance you know if you're doing reasonably well or if you're if you have opportunities to improve ok so now we're going to switch gears and satish will talk about some of the mechanics of debugging and profiling the Gorillaz um using self Louis covered most of the mathematical aspects of the library but when you're doping your application usually you have to deal with software development and betsy has some in build tools to help with that normally we expect users to develop software on the laptops or workstations and then once the code is developed and debug you will run on the big machines so some of the tools we have inside Betsy are useful when you're developing on your laptop on the big machines I'm sure some of the topics of debugging have been covered but this is a glimpse of a basic feature of Betsy which helps you with starting a debugger as low as indicated Betsy can support lot of command and options to switch between various functionality and one of the features we provide is the options starting debugger with sequential debugging it's usually easy to just run the debugger in your application but the challenge is to when you're running parallely and trying to debug it's not so easy unless you have access to a parallel debugger and Betsy kind of provides you a way of accessing a parallel debugger in the sense that it spawns a debugger session for each and every process you might be running parallel or you can choose a subset of processors you want to debug say for example you are running on 64 nodes and you want to figure out what's happening on node 2 or 3 you can use the option debugger notes 0 or 2 or 3 and then the debugger will be spawned only or those two nodes usually you can a debug a window on X terms so that you can control it and we support various debuggers gdb ll DB and so and so forth and sometimes on clusters it's not easy to debug without the parallel debugger so as long as you can forward an ex display you can still do it because our default debuggers spawn through extends let's see also has a lot of error checks all over the code so that in many times are you error within the library let's see error handler is called usually that's a handled by the function petsi error so if you're running code in a debugger you will just put a breakpoint in the pixi or a routine and then you would see whether exactly there're happen if not let's see error is used to print a stack trace on the screen so that you get an idea of where exactly the error happened the respect of Betsy's we keep track of most of the memory allocated within the library and using the functions betsy malakand petsi free which have wrappers on the basic system malloc and free and then we have some extra macros which can check if there is any memory corruption by checking the bounds of the allocated memory using pet SI mulaqat Betsy free so the check mem Q macro can be used by user to throw the code to see if there is a corruption at any given place but with better tools available right now like well grind much of this function it is not needed we strongly recommend our users to use while trying to figure out any memory corruption issues once I code is valgrind clean then you can use it or a big machines usually valgrind is available on a Linux or a Mac machine but not on big machines I guess Betsy also collects performance statistics of all most of the components and we provide a simple interface to accessing most of the performance statistics log view is an option which provides that it gives you information about the timing of various routines the flops that might have been invoked in those routines usually it's a manually computer thing using their function betsy log flops and also memory usage to some extent because all memory that is allocated using the betsy malakpet see free is a kept track of and also all the MPI message sizes and a number of calls to MPI message as communication primitives so the profile provides a glimpse of the application behavior in all these aspects so that one can look at them and figure out if there is some abnormality that you don't expect that's happening inside their application code users can also create their own events for the user part of the code to get a summary of those statistics and we provide betsy log even begin and betsy log and functions so that you can wrap your own routines that can be listed in this summary we can also split up the application to multiple stages so that you can profile each state separately for example when you are using multiple solvers for different parts of the code you would like to get different summaries for each of the solvers so that you can just look at those separately this is example a printout from one of the applications are and gives you a glimpse of what kind of data the locks of you shows you it can show you the summary of number of objects that are created the flops that have admitted clocks at their application used and then a performance metric flops per second and then number of messages from API message lens and the number of reductions in each of those stages and this is more detailed view of a performance for a given application it shows you a lot of state data but it just would like to point you to a few basic things like for each stage it function it gives us somebody of number of times the function is called maybe the time taken and then the percentage of time in that state for example if you look at one of the routines like Smith salt which is a half way down it shows like a it's taking maybe fifty percent of the runtime in that one there are four calls to it and the that's taking fifty percent of runtime for the whole application which means that this foreign n % of time taken by some of the parts of the code which is not the linear solver a nonlinear solver so when you have some when you run the and get this in statistics and see that this non-linear solar is taking just twenty percent of the time it means that you're you should be looking at optimizing something else then trying to optimize the solver so it gives you a basic idea of what to do next it all says list the flops at the very end and then some idea about the communication in the different stages so if you look the one of the fields is like global and stage the global gives you the total for the whole application and the stage is like if you how user stages for that particular stage it gives you the statistics for that and within that there is a time there is a percentage of knobs and percentage of messages and deductions and lens so to some extent sometimes when you're running parallel II that computation routines I might take more time and sometimes it could be the communication route is that met may take more especially when you're running a big number of nodes you might see that the cost of the communication routines might go up and then the percentage of time in this communication routines go and that's usually reflected in some of the communication routines this is some examples of routines which might have communication like red dot vic m dot victim which have directions in them and that can show up in the log summary and when you see that you might might have to figure out oh maybe there's something else i might have to do if that's the primary cost of the whole simulation or maybe something else when you have this numbers it gives you a better idea of where to look for the bottlenecks are as conclusion for the tutorial we can help you with solving algebraic NDA problems we hope that Betsy can provide a efficient way of getting your code running parallel Erie and then we provide support and help and discussions through our mailing list the pit see uses the mailing list which you can use our main is a way of contacting us with bug sir and then if you have encountered any bugs or documentation issues you can send us that and will try to fix those things as as we can and if you have extensions to Betsy we gladly accept contributions lot of our users keep sending us fixes as patches or pull request new features as full requests and we gladly accept those and this is a link to a website where most of the documentation other information is available as mentioned are our user list is active you could subscribe to the users list and monitor that what kind of discussions are going on or we have searchable archives for a user's list where previous discussions are available this is a question actually for you or anybody in the audience hit do you have debugging requests where people want reproducible results to help them debug um reproducible results are what we tend to do is they'll the log view which I showed is more like a has a lot more information than the performance statistics it shows the compiler options and soon so far so it's mo and also the snapshot of the betsy library which is used for that run so it's more like a snapshot which hopefully can provide a way of replicating those results in the future so i was referring the fact that when things happen out of order you get different floating-point errors oh yeah there's there's definitely interested in that of my Carew who's a leader in that community will speak during dinner on Monday about the importance of reproducibility so we're certainly working towards towards that within our library and also in collaboration with some other folks and other projects so important topic if I'm incorporating Betsy in my existing MPI code so like how will it will connect with my existing communicator and that's a great question and and one that in a general sense we can answer in here but we encourage you to talk one-on-one with satish or me later as we pointed out you as the application person had the ability to call the pet see library with whatever communicator you want to use so you could use it for your whole application or part of your application and really that's up to you so you provide the communicators when you create the object so when you create the vectors matrices solvers and whatnot save a lot of flexibility so this is short question in the in the last in the last slide show that you can do profiling and provide the flop count and the timing for for the program but if I if they are the ural provided function for example the shower function we are doing the Polynesian name is that then you as an application person would need to provide counts of flops and whatnot if it's not something that is written within the library itself and there's an interface to support you providing information for your user parts of the code know so there's a call a function called patsy log flops you would use your count the medic in your loops manually and then call this function to log it and then that number is registered within the log infrastructure and the log summary will get that information from this place it's intended to provide support in a way that makes it easy to understand how the various objects fit together and and that is really the reason for the aspect of profiling it also makes sense to use complementary profiling capabilities developed elsewhere in the community for other aspects of profiling so this is really intended to give you a view of what's happening in the different different objects within the overall code
Info
Channel: Argonne National Laboratory Training
Views: 2,213
Rating: 4.8857141 out of 5
Keywords: ALCF, Argonne Leadership Computing Facility, ATPESC, Argonne Training Program on Extreme-Scale Computing, Argonne National Laboratory, ANL, supercomputing, high-performance computing, leadership-class computing, DOE LCF, DOE leadership computing, HPC, exascale computing, scientific computing, Department of Energy National Laboratories, 2016 ATPESC
Id: Knk4BqNXCao
Channel Id: undefined
Length: 62min 37sec (3757 seconds)
Published: Wed Sep 14 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.