Chapel: Productive, Multiresolution Parallel Programming | Brad Chamberlain, Cray, Inc.

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so rusty said most of things I normally say on this slide so maybe I'll just say I don't know if any of you are Monty Python fans but this is sort of time for something completely different than what you've been hearing about the last couple days also I'll be telling you about chapel the subtitle my talk here is productive multi-resolution parallel programming which maybe doesn't mean a lot to you now but hopefully by the end of the talk it will and I'll just mention that so this is about an hour-long talk here you should feel free to interrupt me with questions because I tend to fill every minute and if you don't break in while I'm talking then you may just not have time at the end but I also mention that we have a hands-on session this evening and as I understand that chapel will be one of the three technologies that you can try tonight so there'll be another chance to dig into it a little bit more deeply than we have in this hour so with that we'll get started this is a slide my lawyers make me put in which say you can't predict the stock price based on anything that you're going to hear today which is probably true but let's talk about so why are we doing Chapel again chapels new language we're developing and our motivation for Chapel is sort of this question can we design a language for HPC specifically but also maybe more general for people wanting to do parallel programming that says productive this Python as fast as Fortran as portable as C as scaleable Zen P I and then the last one here is as fun as kind of you know remember when you first start a program and you thought wow this is really fun many of us in hpc don't feel that way on a day-to-day basis it's more like something we have to do and wouldn't it be great if you know programming an HPC could be as rewarding as those first times that you were you were writing program right so we answer those questions we believe that you can create such a language and then I I usually for these talks put up some fake titles for my talk as well so one of my fake titles is Chapel putting the we back in HPC trying to make HPC something exciting and fun and not just kind of something you have to slog through as a programmer so then the question is if I think we can have such languages why don't we have such languages today and a reasonable thing that people might guess the answer to that is well maybe there are technical challenges right maybe this is is really really hard and I think there are technical challenges but I don't think this is the showstopper I don't think this is why we don't have such languages today and I think the real reason is that as a community in the HPC community we've had a distinct lack of long-term efforts of sufficient resources to develop such languages of the community willpower to develop such languages of opportunities for co.design between language developers and end-users to create something that would make sense for both parties and patients right in parallel computing we're by nature and impatient people we want things to run fast and we want them now so when that new machine comes out we just want to run on it we're not necessarily willing to invest for a long time sometimes so this is why I think we don't have a language as attractive as the one I tried to illustrate in the previous slide today and in a nutshell Chappel is our attempt to reverse this trend and so this takes me to my second joke title for the talk which is putting the we like us back in the HPC and what I mean there is I think a lot of people particularly younger in their careers which I think of a lot of you is being sort of imagine that HPC is this vast community and there's nothing we can do to change it and things like you know you're going to use MPI or CUDA or whatever sort of handed down from on high and while we are a large community and if you go to supercomputing you'll you'll see that firsthand the fact the matter is we're not so large that isn't possible for a group of people like the size of the people in this room to make a really profound difference and I think for us to create a language like Chapel and how it'd be successful it's necessary for you know more than just my team to be interested in it and more than my team to care about it and and put things into it so what I encourage people is if you like what you see today don't sit back go that's great you know when are you gonna be done with it and think about it more as like what can I be doing to sort of move this open-source effort ahead whether that's just telling more people about it whether that's kicking the tires as a user and giving us feedback on what works or doesn't work well or whether that's contributing back to the code okay all right so let's get into the actual content here what is Chappell again I've already to the fact it's parallel programming language it's designed for productivity a bunch of kind of one bullet things to describe it it's extensible it's portable it's open source it's collaborative and it's a work in progress and time permitting I've got some slides the end that go into each of those bullets in a bit more detail but that sort of gives you a sense of of what we're trying to build here and the two main goals for Chapel are to support general parallel programming which i think of as being you know if you have some parallel algorithm in mind of some parallel hardware you ought to be able to use chapel to write that algorithm and run it on that hardware and if not we failed in this goal the second goal is as we've already talked about to make parallel programming far more productive now this word productive or productivity is a really loaded term because I've asked each one of you what it meant I'd probably get a variety of different answers so I'll tell you kind of the way I tend to break down the answers that I hear when I ask people you know what does productivity mean to you and if I talk to recent graduates people coming out of maybe their bachelor's or even a graduate degree and say you know what what does productivity mean to you as far as programming languages go often I'll get a response like you know something like what I learned in school you know Python or MATLAB or Java depending on what their background is if you talk to seasoned HPC programmers the answer you often get is well productivity's that sugary stuff that I don't need because I was born to suffer like it's my job in life just to like do whatever it takes to get the performance I'm used to using MPI plus open in P plus CUDA or whatever and this is obviously a tongue-in-cheek answer what they're usually really saying is I need full control over my program so that I can ensure performance right I need to get every flop out of this and I'm not willing to give anything up and this relates to misconception I think we have that if you introduce higher-level features or more programmable features you're necessarily giving up performance and while that's often the case in object-oriented programming I think is a good good example of that I think well-designed abstractions don't necessarily force you to give up performance so I don't think these two things are intrinsically at odds with one another and then finally if we talk to computational scientists say physicists chemists whatever you know what do they want often what they say is you know like I understand parallelism and I'm happy to write parallel computations I just don't want to have to wrestle with a lot of architecture specific details right I don't want to have to rewrite my code when a new system comes online or a new processor type comes out things like that and that's understandable right they want to focus on their science not on computer science so the travel teams answer to what productivity means to us is a combination of these three things we want to design a language that lets the computational scientists express what they want at the science level without taking away that finer grained control that an HPC programmer would want or need and implement in language that's attractive enough that a recent graduate would find it appealing okay so over the course of the talk you'll see the language and you can tell me whether or not I'm succeeding based on where you fit in this spectrum before I start describing chapel I like to start with a little bit more motivation and I'm going to give you sort of the easiest parallel programming problem there is pretty much this is a simple benchmark called stream triad basically we're just going to do scalar multiply of a scalar alpha shown at the bottom of the slide times vector C at a to another vector B added assign it to the first vector a and this is clearly sort of a trivial embarrassingly parallel problem if we wanted to parallelize it we could chunk the vectors up have each thread or task do a sub chunk of the vectors and we have a parallel program and if we this is our my shared memory view because there's one alpha shared between all my tasks in a distributed memory world we might chunk it up and replicate the alpha so that everyone has their own copy so this is my cartoon for a distributed memory and then of course the world we're living in today is typically a hybrid of these things where we have distributed nodes each one of which has shared memory so you have both the distributed and shared memory parallel 'some all right so again sort of simplest program you could imagine it's got parallelism it's got sort of locality awareness we want all those vectors to be distributed in a similar way so it should be really trivial to write and it's not too bad this is that code in MPI c-plus MPI the computation I wanted to do the vector scale ad is the little green loop down in the bottom right the red code is the MPI code and there's not that much of it because it's embarrassingly parallel so it's basically just setup and teardown and the rest of it is just kind of C boilerplate in black okay so again not not too bad if I then wanted to do the hybrid version and I wanted to use OpenMP to do that hybrid computing then I'd add this blue code which says when we mark these loops as being parallelizable and I'll get multi-threading being used to implement those loops so again not too bad and not too hard as you'd expect but I think the really unfortunate thing here is again a completely trivial program we want to talk about perils and we want to talk about locality and these two program models require us to do that using completely different abstractions complete different concepts completely different syntax and that's unfortunate and then if we throw GPUs into the mix and we write a CUDA version the CUDA versions over here in purple again completely different concepts syntax abstractions and it only gets worse as we go from this very very simplest computation to actual real science that you might want to do right this is what people often refer to as like the alphabet soup of HPC where you're sort of mixing and matching these different notations together and and my sort of statement here is that I think HPC as a community we as programmers suffer from having too many distinct notations for talking about the two key things parallelism which is what should run simultaneously and locality where should it run so a fair question asked would be well how do we get to this state and I think it relates back to that impatience that I mentioned before in HPC I think we tend to build these these systems these fast systems and then we want to program them and we approach them from a bottom-up perspective which is completely reasonable we say you know this system has these capabilities what software do I need to access those capabilities and then I I can now run on the machine and I can get the performance I need and then we sort of stop we never kind of keep going bottom-up until we get to high-level abstractions so when it gets things like portability generality programmability usually we're willing to throw some of that away at some point that's not to say that we don't have any portability but again you know why are we not using MPI on GPUs well it wasn't designed for that right so where we end up with with this table we're on the Left column I have different types of hardware perils and you might want to target cross node perils and intra node parallelism vector perils and so on and so forth and the middle I have some of the program models that in practice we use to access that hardware and again it's sort of different models for different pieces of hardware typically and on the right I've got different units of software parallelism right so is is our unit of parallelism the executable like an SP MD model or is it like an iteration of a loop and so the point here is if you want to target multiple types of Hardware ilysm or implement multiple styles of software parallelism typically you find yourself mixing and matching these program models and again that that's works but it seems unfortunate to me so there's some benefits this model we have a lot of control decent generality if the if the machine can do it we can do it these are typically reasonably easy to implement models let's not say they're trivial but because they're lower level they're close to the machine there's not as much software to be written to to get them working the downsides though are that the user ends up managing a lot of detail and the code tends to be fairly brittle to changes that could either be changes in the algorithm like you want to change from A to D to a three D algorithm say or changed in the architecture alright so if we go back to this slide I said HPC suffers from too many different ways of talking about perils and locality if I were really completely honest with you I'd say so let's just stop but of course being a language person I'm gonna throw in one more language right let's use Chappell but the key is I'm not saying let's use Chappell and all these other things I'm saying let's just let's just try using Chappell so this is the stream triad code in Chappell and you'll see it's I haven't left anything out here this is the full program so you'll see it's much shorter than the other codes we were looking at and that's just by virtue of the fact that it's a more modern higher-level language but the other thing that's interesting about it is that this one code could be run serially shared memory distributed memory that hybrid shared distributed memory it could run an accelerator it's it's designed in a way that's very very independent of how its map to the architecture and what kind of architecture it runs on and the key there is this D mapped clause which I have a lighted one expression from but there's it's really one expression and this Clause says how should I map this computation down to the system and if I want to change the system or change the mapping I can change that one clause and all of the science which in this case is like the vector multiply add remains independent of of that change it just sort of follows along with that change so over the course of my talk I'm going to kind of build up to this and by the end of the talk you'll understand all these concepts but for I'm just gonna sort of throw it at you and move on but the philosophy here is that with a top-down language design like chapels where we say we want to talk about perils and we want to talk about locality we want to talk about how to map that to the Machine and then we'll worry about all the details of how do we mapped all the different architecture types we think we can come up with a much better language that teases all these important details into the the separate camps where they ought to be where the algorithm person can work on the algorithm the HPC person can work on the HPC things the compiler and the runtime can work on things that they can handle and hopefully we'll all be happy that's the vision all right so I've just given you some motivation for Chapel next I'm going to give you a quick survey of Chapel concepts and this will be kind of a brief tour of language in an hour I can't give you a full detailed description language a full day is a better amount of time for that or two days even but for now I'll give you sort of a flavor and hopefully you'll know enough to kind of be dangerous with Chapel as you leave the room and before I actually start showing you features let me define one of the terms that was in my title this term multi-resolution which we don't often use in in programming language contexts so what we mean by this is in the past there have been other high level languages and the problem that many of them have had is that whether abstractions work for you you're happy and when they don't work you're sort of stranded you know far away from the machine with kind of no recourse and the thing we did with Chapel was we said we want these high-level features but we want a way to get down close to the machine when necessary so if those high-level obstructions fail you either because they tie your hands in a way you don't like or maybe you're not getting the performance you would want you can drop down to lower levels and get closer to the Machine as necessary and so the idea was to build this language that was had multiple tiers to it some of which were higher level some which were lower level and in which you can sort of move between levels from one function or statement to the next and not only that but the highest levels the language here are actually implemented in terms of lower levels and that guarantees that the law interoperate well and cooperate so you as a user can write these highest levels things like you could write your own array distribution for example and again that's something we'll build up to over the course of the talk so I'm going to use this sort of conceptual stack of features as kind of my roadmap as I'm walking you through the language and I'll start out with the or three tears of language which we call lower level chapel although it's still quite high level compared to you know C or P threads or something like that and specifically I'll start with the base language and you can think of the base languages being if you took chapel and you ripped out every feature related to parallelism and every feature related to scalability the base language is what you'd be left with so you can think of it as like this serial C or Java or Fortran on which chapels based except they're rather than basing it on any of those languages we started from a blank slate although we did take a lot of influences from other languages of course so I'm going to show you an example and just kind of point out some features as I go this is an example that implements a Fibonacci iterator and does a little loop over it printing out the numbers so over here on the left you can see the Fibonacci iterator itself this is what we call clues style iterators which is one of the first languages that have these if you're Python programmer you probably think of it more as like a Python generator if you're not familiar with either one of those concepts it's a lot like a normal function or procedure except they're rather than returning a single time you've got this yield statement which says pick a value back to the call site but then logically continue executing so when I invoke it in a loop as over here on the right you can see for F in fib of n I'm basically saying to this iterator you know give me n values and then as the iterator falls out as it returns then I sort of exit the loop and then in the black console here you can see the output which would be the the first few Fibonacci numbers coming out okay so you know we think these iterators are really nice we think every language to have them we've got it another thing you'll see over here on the right is this thing called a configuration constant a lot of our declaration styles have the ability to have this config put on front of them and this gives you sort of automatic command-line parsing of that variable so here you can see I'm initializing in the code the variable to be have the value 10 so I've got 10 Fibonacci numbers when I run it but because I put that config there when I run the program as you can see up in the blue box at the top I can override that default and say well let's make n equal to a million for example and and so then I'll get a million Fibonacci numbers out so the goal here is kind of hopefully you'll never have to write command-line parsing argument parsing again or at least you know only in very extreme situations another thing that hopefully is is sort of very obvious is that I haven't declared the types of my variables my arguments the return type of my iterator and you can do all of these things so I could say and as an integer current as an integer next as an integer but we also allow the compiler to infer it so it's very important to point out that unlike a scripting language like a Python or MATLAB the compiler is still figuring out statically what the types of all these expressions are so you're not we're not paying any execution time cost for dynamic typing or trying to determine what the types are and we don't run into any of the safety issues that dynamic typing gives you either it's just a convenience mechanism where you know I'm passing an integer into this thing the compiler should be able to figure out it's taking an integer why do I need to tell it so that's one side of the argument of course the other side is that specifying the types of things helps with documentation and keeps your interfaces really firm and things and of course you can declare types if you want to here I'm just showing that you can also leave them out so when you do leave them out the compiler basically says things like I see that n is initialized with 10 I know that 10 is an integer therefore I know n is an integer icn's being passed a Fibonacci which has an argument named n I know that that formal argument must be an integer I can then flow through and see that currents next Ranna jurors and I'm yielding current so this thing's yielding integers so f is an integer back on the right and it basically just follows your flow of your program figure out what the types are again you could also assert all of those types in the code or some of them and not others in my slides I usually leave the types out both to kind of show it off and because it takes up less space so the next thing I'm going to do is change my invocation of Fibonacci a little bit I'm going to throw in the zip keyword this is pretty important keyword in chapel this basically says iterate over multiple things simultaneously such the the corresponding iterations line up so here I'm iterating over a range zero to hash n and my Fibonacci iterator oops and I forgot to update the console output shoot I'm sorry I was editing my slides too late at night the console output should have changed to say fib of 0 0 fib of 1 is 1 and so on and so forth I made a cut and paste error apologies so anyway that's the zippered iteration across multiple things you also saw these ranges as I went they're just useful for representing regular sequences of integers so I've got one to n over on the left on the right here that zero to hash n is is an instance of an arrange operator what I'm actually saying is start from zero count to infinity but then give me only the first n elements of that range so this is sort of a cute way of writing 0 to n minus 1 basically I also see we've got two poles so the result of a zippered iteration is a tuple of values and I'm capturing that here in the tuple of variables I and F we also use these to represent multi-dimensional array indices and to return multiple values from procedures so that's kind of a short example of some of the base language features there are a ton of other ones which I can't get into today due to time we've got interoperability with cn MPI in particular object or any programming overloading where clauses yada-yada-yada sort of everything you'd like in a modern programming language hopefully all right so that's the base language moving on let's look at some of the parallelism we have two styles of prale's min Chapel there's lower-level task parallelism which I'll talk about next and then just higher level data parallelism and for us task parallelism is saying like create a task to do this create another task to do that and data parallelism or like for all the elements in myarray or for all the indices in this space you know do something so for tasks pearls and there are three different ways to create tasks we'll see two of them today the simplest one is just attack this begin keyword on to the front of a statement and this says create a task to execute that statement while the original task just continues executing so here I'm saying create a task to print out hello world the original task will keep running and it will print out goodbye I haven't done anything to coordinate or synchronize between these tasks so they could execute in either order and I may see hello world goodbye or I may see goodbye hello world depending one of things I won't see is the inner splicing of like HEG o ll you know where the messages get mixed together and that's because our right line routine is basically thread safe task safe all right so that's the simplest way to create tasks and that gives you really sort of unstructured way to create what we call fire-and-forget tasks just great things to go off and run things we also have more structured ways of creating tasks one of them that we use a lot is this Cofer it's a lot like the cereal for loop we saw from before but what a Co for all says is create a distinct task for every iteration loop so if I have four iterations I'm going to end up with four tasks each one of which is going to execute one iteration of that loop body and here I'm just using it to write another little hello world message you know hello from task three out of four kind of thing and one of the things about the Cofer all unlike the begin is that because it's this structured thing when you get to the end of the cofre loop you wait until all the tasks you've created from that loop complete before you go on so again because I haven't coordinated between these tasks the output between the may come out in any order you know two shows up here before zero for example but that all tasks done message won't print until all the tasks to print out their messages because of that implicit join at the end of a KO for all alright so those are again two of the three ways of creating tasks the third one is basically just a compound statement in the interest of time I'm skipping past it and then one other thing I wanted to say about task parallelism is in my simple examples I just create tasks that pronounce but of course in real programs tasks often need to coordinate with one another and the two ways in which tasks can coordinate with one another in Chapel is through atomic variables which are a lot like the Atomics and c and c++ and then we have sync variables which are maybe a little bit more unusual sync variables are basically like normal variables but they store a full empty bit along with their state so like a synchronized integer would store an integer value and then this fill empty bit and so reads on sync variables block until that flumpty bit is full and then leave it empty and writes block until it's empty and leave it full so it gives you kind of a way of doing a producer and consumer style synchronization between tasks and it makes it really easy to write a bounded buffer for example all right so that's how our tasks coordinate with one another and share data this is sort of a laundry list of other tasks parallel concepts the co begins are the third way of creating tasks that I mentioned single variables are variant to those full empty variables I mentioned and the way of a few statements that are used to squash perils and more to synchronize between tasks so that's the low-level way to create Perls min chappell okay so next we're gonna talk about locality control and locality again is all about where should tasks run on the machine where should data be stored on the machine and you can imagine in general you're going to want to have some sort of affinity between your tasks in your data and so locality is all about that and the key feature for locality and Chapel is this type that we call the locale just probably little circular there for all intents and purposes think of a locale as a compute node right so you're running on some large machine think of each compute node as being a locale we support these so you can think about sort of here versus there if two things are on the same locale it's going to be fairly cheap for them to coordinate or communicate with one another if they're on different locales it's gonna be more expensive because you're going across a network right and that's the whole point of the locale concept when you run a chaplet program you specify the number of locales on the command line so here I've shown the long and short forms are saying I'd like to run on eight locales and again what that means is go out and give me eight compute nodes and get my program spun up and running on those eight compute nodes within the text of your program there are a few built-in variables that you can use to refer to the locales on what you're running num locales which is just an integer saying how many locales you're running on and then more importantly this locales array which again is going to be this array of type locale which basically has a one-to-one correspondence between the compute nodes on which you're running and this abstract locale type that's built into the language right so you basically have a first-class way of referring to the machine resources that your programs running on within the text of the program itself and we'll see why in a few more slides so one other thing you need to know on this slide is when you start running your chaplet program we're not in SPMD model like MPI or if you've used UPC or Co a Fortran in any of those Chappel programs start logically as your main procedure is running as a single task and look how 0 and then if you want to use other locales then you have to use concepts to spread your your computation out to those locales all right we'll see those in just a second so with these little columns what can you do one of things you can do is you can introspect about the machine on which you're running so given a look how eval you can say things like how much memory does this look I'll have or how many compute cores does it have or what's its ID or its name or things like that so anything you want to know about the machine you're running on you would do through this look how interface and then the other thing we use it for is again to move confrontation around so the on clause is the primary way of migrating computation across the machine again my program here is going to start running conceptually from locale 0 so that first right line is from locale 0 and then you see this on clause that says on locales 1 so I'm indexing into that locales array I'm going to the next locale and that basically migrates my tasks over to the locale logically and that next right line is gonna be printed from locale number one and then when I leave the context of that on Clause I basically pop back to the original cow so that third right line is going to take place on look how zero alright so again this is a way of moving the computation around the distributed memory machine and this is sort of a silly and artificial example in practice you wouldn't normally say like run on locale 13 for example and of course there's brittleness and doing that because then if you run on fewer than 13 locales you'll get an out-of-bounds error on this on this array access so typically what you're gonna do instead is use more of a data-driven style of on Clause so if you put any expression after the on clause it'll say go run wherever this expression is so for example if I index into an array a sub IJ the on Clause says well wherever that element I a sub IJ is go over there and run this big computation it's expensive but I want to run it near that element or if you're searching a graph or a tree and you say something like well let's go wherever the left child of my node is go over to that locale and continue searching there something like that right so by doing these data driven on clauses you sort of independent of the number locales as long as you've spread out your data intelligently across look howls and we'll see how you do that as we go on you're basically saying like just go wherever that variable is like that's where I want to be for this task alright so something I hope is obvious but I think it often isn't because we don't see it in many of the program models we use it typically in HPC is that in chapel parallelism and locality are completely separate concepts in language right so this this cofre loop like the one I showed you earlier this is a parallel construct it creates parallelism it creates tasks but it's a completely local program nothing about it says you know run anywhere other than here so by default all those tasks are going to be running on the same locale that the original task was all right similarly if we have a code that uses on clauses like the one we saw before that's going to move a task around the machine but it's completely serial program right there's no parallelism here it's like I'm printing something here then I'm going over there and printing something there going over there and printing something over there and of course the idea is that you can then mix these things together so I can say let's create a Co for all and then let's use an on Clause within there so that each task goes to a different node and then we have both parallelism and distribution at the same time and again I think this is key because if you think about it sort of parallelism again what should run simultaneously and locality where should things execute are really orthogonal context constructs and I think it's unfortunate that the SPMD model has sort of put us into a world where you sort of used to the only way of talking about perils must create another image of your program and that's also your unit of locality the two things are just kind of bound together unless you start mixing in open and P or P threads but then of course you get down to this like mixing stuff together again which is what we're trying to avoid okay so in chapel we said these are two separate things let's use two different steps the language features to address them all right so I mentioned something about you know go where the data is and somehow you get the data somewhere else so let's talk a little bit about that as you declare variables in chapel they will be allocated within the memory of the locale on which your task is running so again if my program starts running from the locale 0 and I declare an integer I that's going to be allocated in locale zeros memory as you can see in the picture at the bottom then if I use a non clouds to go over to look how one and I declare another variable J I'm going to allocate that in look how one's memory because of course that's where the task is running now and then I can use this Cofer all on idiom Co for all local McColl's create a task per locale and then move each task to its respective lookhow so now I've kind of created an SPMD loop essentially within my Chapel program and then if I say something like give me a variable K this an integer each one of those tasks of courses is seeing this declaration in allocating its own K signed up with a copy of K per locale where each task will refer to its own copy because that's the only one I can see lexically speaking so then within this loop I can do things like k equals two times I plus J and of course I'm still within the Coe for all so each task is going to be executing this on each locale and the point here is that even though so looking at look how three for example its task is gonna run cables two times I plus J of course I and J a remote but it's okay to access them even though they're not on your locale so Chapel is part of this family of what's called P gas languages I don't think anyone's lectured on those yet this week but maybe you've heard about them back home the idea is if you can name a variable if you can refer to it here through lexical scoping then you can access it whether it's local or remote and the idea is that it's going to be the compiler in the runtime that implement that communication for you so in this case because look how free and in fact all of Kyle's referring to I and J the compiler in runtime we're gonna have to make sure that copies of I and J are brought into locale threes memory so we can actually do that operation and in practice that's either done sort of in a demand driven way by going in getting the value at the time it's needed or it can be done more optimally in this case in fact I and J would be forwarded as we created those tasks and spread them out across locales we'd sort of send copies of I and J along with them to avoid the communication back all right but again the main point here is if you can see something in your lexical scope which if you're not a big program large person just means if you can sort of see it looking up the Scopes of your program like you normally would in C then you can refer to it regardless of whether it's local to your locale or not and that's both a great convenience because you can name any you can access anything you can name and it's also a potential big performance problem because if you're not careful you could always be referring to things that are remote and just chocking up a lot of communication before I go on there are two other things I want to I want to say about locality so one of them is there's a built-in keyword called here which basically evaluates the locale in which the current task is running so that's a way if you just sort of lose track of where your locale is and you just want or where your task is running and you want to say well where am i on the machine you can use here to say you know which look how am i running on and the other thing you can do is given any variable you can say which look how is this variable live on just by applying the dot locale method to it and using that you can say stuff like if here equals J dot locale then I'll do one thing in otherwise maybe I'll do something else all right so I mentioned that you don't see the communication in chapel programs what we call implicit communication you just refer to things and the communication happens and again that's a double-edged sword the nice thing about chapel is the semantic model is very explicit about where data is placed and where tasks execute right so we don't sort of move things around on you magically under the covers we don't really trust pilers runtimes to do that very well so you know if you understand these rules I've given you you should know exactly where all your data is and exactly where all your tasks are and that's I think an important property in a programming language on the second thing of course is if you don't want to reason to it all you can use these execution time queries that I mentioned to sort of figure out like where as this task or where's this data and the third thing of course is that I think tools are an important part of the story here as well and we have a tool called Chapel viz for example that you can bracket a section of code for example and see things like what communications am I doing here or what tasks am i creating and so if you know again if you don't want to reason through it with a semantic model or you just don't understand kind of it seems like there's you know big bottleneck here what's going on tools like this can help you figure that out of course all right so that's your introduction to low-level chapel again the base language the task parallel ocala T features and again let me just pause and see if there any questions because you guys are so quiet you're making me nervous although I see a lot of eyes on me which is encouraging yeah yes the question was can you have an array of synchronization variables and the answer is yes so synchronization variables can be composed in array as array elements or fields of classes or records or things like that as well and in fact I mentioned briefly you know it's a great tool for bound to buffer types of problems the way we actually do that is we create an array of synchronization variables and all these kind of error cases you normally have to deal with manually and bounded buffer like is my producer you know so far ahead that is wrapping around on itself or as my consumer kind of trying to consume things that aren't produced yet the flem qubits make all that just kind of go away it's like it works really really well so that's actually a very common idiom as an array of synchronization variables yeah other questions yes you mean a variable with the same name on two different locales okay so let's go back up to the slide that did that so here for example you could argue that I have five variables named K and so you know do we get confused about it and this is a really common question in fact I think I get asked this almost every year which maybe means I'm not explaining it well but the key thing to realize here is if you don't think about the picture too hard if you just looked at the code and in fact if you even ignored like the the constructs you more familiar with think of the Co for all is a for loop and ignore the on Clause you know normally if you are in a C code you wouldn't say like how do I refer to the other iterations copy of K you would sort of just look at that and say like oh well every iteration has its own copy of K I can only refer to the one that I can see in my lexical scope you know end of story and it's the same thing here even though we have pearls in there multiple tasks executing at the same time each one only knows about one k which is the k in its lexical scope the only way I can know about yura K would be for me to see it and again the way that might happen is for example I can see I and I can see J because they're in my lexical scope it happened to be somewhere else so the only way you'd be able to see it is if there were kind of two declarations of K in your lexical scoping but if that happened the normal shadowing rules occur and it's only the innermost one you would see so it really isn't a problem and more than that I would say it's completely intuitive like what you would expect to happen happens and you don't run into those kind of challenges was there a question over here as well multiple cores are being used yeah so typically what happens with the so the question was kind of I think where you're going is kind of what how are these Co files actually implemented or how are they mapped down to the system essentially is that right yeah so what our language defines is tasks and tasks are again these units of a parallel computation and then typically what we're doing is mapping those down to thread and the threads would map them down to they're actually a number of options available to you so one model for example is that each task runs on its own P thread and so if you have as many tasks so there were cores and your operating system spread out the P threads well you'd end up using all the chorus locally another model we use is to use a lot of user level tasking where we switch between tasks within a P thread and that has lower overhead we can also get better locality benefits so in that model we're we're still running on on POSIX threads but we're doing sort of user level multiplexing of the tasks across those threads and those are different options that are available in the chapel runtime the language actually says very little about how tasks are mapped to the machines so to get the full story like if you really want to understand exactly how these tasks were going down to the machine you have to sort of think about well which one of these tasking implementations am i using and what kind of semantics does it guarantee yes so we have a custom compiler for Chappel it Maps it actually compiles down to C and then we have runtime libraries to provide things like the tasking in the communication so we've architected so that for example let's say you had your own tasking library that you thought did a much better job than P threads are the ones we're using the runtime is architected so that you could plug your own tasking library in there and and basically if say things like well how do I create a new task and how do I synchronize between tasks and things like that sorry if which compiler introduced new features okay yeah so the question was if if the backend compiler like the Intel compiler introduces new features will we benefit so typically we do source source compilation so we generate C code and to the extent that the back end compiler can optimize our generated C code then we will benefit from those things we also help the back end compiler sometimes so we often omit into our code things like oh we know that this loop is vectorizable please help with vectorize it for us and things like that all right so we're gonna pop next to the higher-level features of Chapple and this is data parallelism and these domain maps which I'll define as we go so in data parallelism I'm again going to do this by example and I'm going to go back to that stream triad computation I showed you at the beginning so that now you know enough to to kind of see all of these features so the first thing I'm declaring here is something that we call a domain a domain is kind of a unique feature in Chappel it's basically a first-class language concept that represents an index set and so here it represents the indices 1 through m and I've drawn a picture here that makes it look like I've basically declared an array but it's crucial to understand that this isn't really an array it's just like the indices that you might use to create an array or that you might use to drive a loop iteration or something like that so one of things we use these domains for is we use them to declare arrays so here I'm saying give me three arrays a b and c the square brackets say this is an array and problem space says you know create this arrays such that it spans all of the indices defined by this domain so 1 2 m and then real says of course every element of the array should be a real floating point variable so now I basically change that I think it's an instantiation right I've taken that domain template and I've created three arrays that share that index set so now we can use the key control feature in the data peril section which is a Frawley and a for all loop like a co for all is parallel but unlike a co for all it doesn't say literally create a task for every iteration what if Rahl basically says is create some tasks do this execution in parallel and sort of use an appropriate amount of parallelism where appropriate amount typically is sort of proportional the number of cores on which you're executing right so if you're on a 4 core system you know use four tasks to do this loop and I'll get more into that in a bit you see also I'm using the zippered iteration here so I'm saying do a parallel zippered iteration across these three arrays doing the the scale add across the elements so as I go there's one other way I could write this which is slightly nicer I can use whole array operations in chapel and actually just say array a equals array B plus alpha times array C and this is semantically equivalent to that zippered for all it's just a arguably slightly nicer way of writing it so this gives you actually a lot of the core data Prioleau features in the in the language domains arrays and FRA loops there a bunch of others that I never have time to get through the data parallel section is one of the bigger sections language but just to give you a quick survey what's there in this example I just showed you I'm only using simple one-dimensional arrays Chappell has a really rich set of array types including multi-dimensional arrays strided arrays sparse arrays and associative arrays which give you like a hash table or a dictionary like concept so my slides I won't get into many of those today but it's important that you know that there's sort of this rich array computation fabric out there we have array slicing which is a way of referring to a sub array using ranges or domains so here for example I'm saying you know give me a subset of A's elements either defined by these ranges that I've been lined there or maybe I've declared a domain called elements of interest which stores all of the indices that I care about and just to give you a sense how rich this can be lets say elements of interest was one of these sparse domains and a was a dense array you could for example set up a sparse at indices and say well let me refer to just these elements the array that these industries correspond to and do it with this really compact expression we have promotion which is the idea of taking a function that was designed for scalar arguments and passing array arguments in and that gives you the equivalent of doing a FRA loop over that function call essentially and then we have reductions and scans you know I think you've heard about reductions probably in every program well you've heard about this week we've got them as well you can also write user-defined reductions so if you're kind of coming from a MapReduce kind of world you could write your own reduce operators basically so that's data problem in a nutshell again kind of a quick guide but we'll actually view more data perils in this next section so let's talk about domain maps and domain maps are all about how do we map these domains and arrays and data parallel computations down to the system so given the stream triad as I've shown it to you so far again it's a completely smart question to say well how does this actually run on a system and the answer here is that I haven't said anything about how this domain is implemented and so like all other variables in Chappel it's going to default to being allocated on the current locale that my task is running on and what that means is that those arrays are all going to be local to my locale and the FRA loop is going to use only resources local to that locale so if I'm running on a fork or compute node for example I'm gonna end up with these arrays locally in memory and when I hit that Frawley pi gonna create four tasks each one of which is going to do a quarter of the work okay now this gets back to kind of the teaser I gave at the beginning of the talk I can throw this D mapped Clause onto my domain and this says how do you want to map that domain and it's arrays and loops over it down to the architecture and here I'm filling in that expression I left out earlier and I'm saying let's map it down in a cyclic manner using a start index of 1 and what that's going to do is it's going to take sort of the entire 1d index space which I've thrown up here on the top of my slide it's going to take that first start index that I passed it and it's going to start dealing things out round-robin starting with locale 0 at that index and so what I essentially get is a cyclic partitioning of the entire 1d index space which implies a cyclic partitioning of my domain which implies a cyclic partitioning of my arrays over that domain which implies the cyclic partitioning of the work when I do the for all loop at the bottom and I should mention even though I don't have a good way of drawing it here so basically each compute node now owns a fraction of the work and it's not only going to be doing its fraction of the computation it's also going to be using multi-core parallelism to get the the shared memory perils and within each compute node so this is sort of hybrid perils on both across compute nodes and within compute nodes now if I decide that cyclic really isn't the right thing for my problem as it actually isn't for this because you really would like to take advantage of some locality here I can then swap in a different domain map like in this case I'm using the block domain map the block domain map is characterized by a bounding box I'm going to partition that bounding box up across the processors so we partition it up that implies apart shifting the domain which implies a partitioning of the arrays and of the loop and so again I end up with this hybrid where I now have distributed memory perils them at a coarse grain and then each one of those locales is going to also use fine grained piles and within itself to distribute the local elements ad owns ok so this takes us back actually to the beginning of the talk where I sort of said look at this nice short code if I just change that one Clause I can end up with very different implementations of it and so now you've seen an example of that and I've seen enough language to kind of have a sense of how that works just kind of build on this very simple example again we have a number of different array and domain types and each one of these can be distributed across locales so like up here in the upper right I've got this sparse domain and sparse array you could use some sort of a crystal by section to distribute that across locales for example and again computationally you would you would operate on it just like you would any other array so let me show an example in a real code this is Lulu SH this is one of the do-e proxy applications and it's one of the early proxy applications we studied in Chapel this is the chapel version of it and as you can see it's amazingly elegant and clear you know you guys understand probably every line of this code now okay so I'm obviously joking so what I can say at a high level is that it's a it's a reasonably compact code it's about fourth of the size of the c-plus MPI plus openmp reference version and in fact ours is a little bit more capable in some ways but here's the more important thing our code supports really drastic decisions about things like do you want to use a structured or unstructured mesh for this computation do you want to do this locally or in a distributed manner do we want to use sparse or dense arrays to represent our materials and all of those choices are implemented using the very small number of yellow lines of code that I've highlighted here and that's because of these domain maps we can make these very important decisions about data structures and how to map them to the architecture in a very small number of lines of code and all the rest of the code is basically physics and so you know we talked earlier about application scientists want to just get their science done don't want to mess around at the Machine a lot to me this is sort of a good indication of how we think Chappel helps here is by sort of making the decisions about how to map to the machine restricted to a small amount of the code we can keep a lot of the algorithm very independent of those decisions and again domain maps are the key here so if these domain maps aren't clear to you essentially what they are these recipes where conceptually in our minds we have this high-level global view of a computation like for all elements in these vectors do the multiply add assign and the domain maps essentially say how do I take that high-level computation and map it down to the distributed memories and the multiple cores that I actually have on real systems so there are three key things to know about domain maps one is that when you download Chappel there's a library of domain maps that comes with it things like the block and the cyclic that you've seen here the second one is that end-users like yourselves can actually write your own domain maps in Chapel that's not to say it's easy but it's possible so if for example you're like well my computation wants a very specific kind you know we've developed this heuristic way of distributing our data and you haven't anticipated that it's not in your library you could write your own domain map D map your arrays using your distribution and in fact you could contribute D map back to the code base then and others could benefit from it as well and the third thing here to know is that all of the things in bullet one all of the standard domain maps we provide are written using the same framework that you would use in as an end user and we've done this to kind of eat our own dog food and make sure we don't set up a performance cliff where built-in things work well and user-defined things work terribly you could be argue we've made everything work terribly but we've basically been working to make everything work better and better and better over time all right so that's domain maps I'm just going to mention in passing there are two other very fanatically similar features in language you can define your own parallel iterators so as making some statements about this is how the for all would be mapped down at the architecture again if that didn't make sense for your computation you can define your own ways of implementing for all loops and saying how to how many tasks to create where those tasks should run how to distribute the iterations across the tasks you can also define your own locale models which is basically an abstract representation the target machine what is the architecture I'm mapping to how do i map tasks and memory and communication down to that so if you developed a new architecture that we'd never seen before and you weren't willing to work with us on it you could define your own locale model and basically get Chappel running on it again by writing Chappel code without going and modifying the compiler so that is in a sense at the beginning I said chapel's extensible that's what I'm saying right you can write your own array implementations your own parallel iterators you own architectural models and we think this is crucial in language to be future-proof right the problem we keep having is architectures change and then we have to develop new programming models or our program models have to change with travel we've tried to design something where again top-down parolin look Holly what matter how do you map it to the machine and then allow people to map to the machine and to any machine any way they want to right that's the vision here so to summarize language I think HPC programmers you guys and me deserve better programming models I think that higher level program models like Chapel can really help insulate algorithms from parallel implementation details as we saw in the lush example and yet in a way that doesn't advocate control you're not just saying trust the compiler or the runtime to do magical things you can still reason about every step of the way because it's all built within Chapel it's all built on these same building blocks and we think as a result this chapel can greatly improve productivity both for current and emerging HPC architectures but also for people outside of HPC mainstream maybe hobbyist programmers data analysts anybody who cares about parallel programming at scale and so Chapel is portable a lot of people in they hear about chapel they assume it's crisis if ik Cray definitely is leading the implementation in the design but we're doing it in a way that is designed to be very very portable so in the implementation to run you need a C and C++ compiler a UNIX style environment POSIX threads and some way of communicating which could be our team a MPI UDP basically things you've got on almost any system you've ever run on and as a result of this Chapel can run on laptops and workstations commodity clusters the cloud crisis tombs those from our competitors and modern processors like Intel Xeon Phi GPUs things like that it's open sourced all the develops being done at github it's licensed as Apache 2.0 there's instructions for downloading and installing and this is the picture the chapel team at Cray they're currently 14 full-time employees working on chapel this summer we've also got three visitors it's a collaborative effort so we have a number of colleagues in academia and National Labs industry who are also working on chaplet related projects and it's a work in progress and you know if you if you had for like one hour you want to devote to chapel after this and you're coming from user perspective click up this keynote from our workshop that happened a couple months ago this astrophysicist at Yale who's been looking at chapel over a number of years and kind of in this past year we've gone from being too interesting to something he can actually use for his science and so he gave a great keynote when he talked about what is the value he sees in chapel and how does he think it's going to help him with his research going forward again that's on our YouTube channel here's some resources know for after today so we've got our main project page at Chapel doc Raycom our github repo we've also got Facebook and Twitter feeds if you do either one of those if you want to read one 30 page summary of chapel or give it to someone else to read kind of more or less what you've heard in this talk and a little bit more I don't know if anyone else talked about this book it's a really great book that Pavan edited that came out last year and the Travel chapter in here is kind of my recommended starting point as a reading it's also available online if you are not interested in buying the book if you don't even have time to read 30 pages anymore or you've got a manager who doesn't hear some blog articles that'll give them just a flavour in about a thousand words and this is a list of our mailing lists all these slides are available online so maybe do I have one question that wrap it up yeah back here tools yeah so the question was kind of what's the state of tools for chapel and the unfortunate reality is we don't have very many today the chapel viz tool that I showed is the main one that's chapel specific because we generate C code if you're a little bit brave you can use standard C tools on the generated chapel code how easy or hard that is depends on what kind of chapel code you're writing and how brave you are cuz our compiler messes things up a lot along the way so we're in this sort of interesting chicken-egg thing where tools people like well I'd like to develop tools for chapel but I want to know what's going to be successful and users are like well I want to use chapel but I want tools for it and we're kind of in a deadlock so we're trying to figure out how to get out of that deadlock I'm in chapel visit our first step to kind of offering up a first tool but yeah I'd love to see lots more tools here particularly debugger can I invite a chapel program for mine for my no didn't communicate with other programs other doors using MPI yeah so I mentioned very briefly that we have interoperability and and we think interoperability is key for our success the success of any new language because if you can't work with old code you're sort of creating an island for yourself right so we interoperate with see a lot and we wrap a lot of libraries that way and just in the last few months we've started interoperating with MPI and there are two ways you can do this you can either write a chapel node code that's like a shared memory code and use MPI between them just as you would normally use say c-plus MPI or sorry C an open MP plus MPI so Chappell could replace that c-plus open MP piece but you can also run chapel in a mode where you run chapel across the multiple compute nodes or look house so you're in this distributed meri Chapel world you can use the PCAST address space you can refer to things wherever they are but then you can also pass messages between your locales and this I think it will be useful particularly as we're still working on performance issues which we are I didn't say much about that if there's sort of a 10% of your code that you really care about the performance you can imagine for example relying on all of chapels nice global address space everywhere else and then in that code kind of lock it down and say I'm only gonna refer to local things I'm going to do all the message passing myself to make sure that the compiler doesn't introduce any unnecessary communication as that's something that the user of the Yale user I mentioned Nikhil he's been actually he developed this MPI mode and he's been using that in his own work to sort of do this sort of 10% performance tuning 90 percent productivity kind of thing
Info
Channel: Argonne National Laboratory Training
Views: 2,094
Rating: 5 out of 5
Keywords: ALCF, Argonne Leadership Computing Facility, ATPESC, Argonne Training Program on Extreme-Scale Computing, Argonne National Laboratory, ANL, supercomputing, high-performance computing, leadership-class computing, DOE LCF, DOE leadership computing, HPC, exascale computing, scientific computing, Department of Energy National Laboratories, 2016 ATPESC, Chapel, Brad Chamberlain
Id: 0DjIdRJIqRY
Channel Id: undefined
Length: 56min 47sec (3407 seconds)
Published: Wed Sep 14 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.