An introduction to Message Passing Interface via MPI4Py

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right this lecture is gonna be an introduction to the message passing interface or MPI we've already talked about a little bit in the introduction of parallel computing and that it's most useful for and it's the de facto standard for distributed memory parallel computing so this basically just says what I just said the standard MPI application programming interfaces that are out there are typically written in C or C++ and Fortran and these include such implementations as openmpi go ahead and write that openmpi in pitch is another one there's also in the pitch which is more useful for more used in architectures that have InfiniBand interconnect but all three of these MPI's I'm sorry all three of these api's should be implemented in such a way that they conform to the standard MPI two so it shouldn't matter which one you use you might get slightly different performance but the application programming interface and each of those three languages should be basically identical we also have an implementation that's accessible in Python through the module MPI for pi so the kind of a programming paradigm that we use in in message passing is that an MPI program is launched as a separate program or processor tasks each with their own address space and this requires partitioning of the data across tasks so typically we have some data we break it up into chunks and then we send those chunks out to all the different processors to get operated on the data is explicitly moved from task to task that is that we explicitly call methods that tell the data you know to go from one process to another or from one process to all the other processes and this is kind of the idea behind the message passing and so there's two two classes of message passing there you know point-to-point communications and their collective communications so we've already done a little bit of exercise in class with send and receive and those are point-to-point communications so as I mentioned MPI for pi is a module for python that gives us access to the MPI application programming interface in almost one-to-one correspondence with the C++ interface so if you learn it in Python while the syntactically it should be easier if later on you need to move to C++ due to performance needs it should be very straightforward another nice thing is that we can also pass as messages very general Python objects and so of course you know as with all things Python what what you lose in performance or may lose in performance you gain in a shorter much sort of development time so we begin each MPI program with establishing a communicator object and so typically we instantiate it with the macro that's called MPI comm world this would be defined in a header file and C or C++ as it looks there and typically the way this is used is that you call this at the beginning of your program MPI comm world is defined as all the processors of your job and you know you would call this at the beginning and then what that would do is assign a unique rank or a unique process ID to all the different tasks that were instantiated so if you call this on two processors or if you called MPI on two processors then this would establish two ranks if you called it on four it was establish four ranks these are you know ranks again are just kind of these process IDs they start their index from you know the the zero index says all things see so zero one two three would be four four processors and again they're assigned by the system when the MPI object is instantiated so very simple hello world program where we're gonna and this is using the Python MPI for pi so then we would basically assign the MPI comm world object that's instantiated via this command to a variable comm and then comm has a as a attribute associated with it rank so if we wanted to know how many you know the the name of each process that was initialized we could use this comm rank to return that so we've already seen this in class but I'll go ahead and run it real quickly for you from the command line so the way we'd run this is MPI exe C number of processes say 2 then we'll run Python a little and so if we run that you can see that it runs the hello world program returning the rank the index of the rank at each process so we could also you know say if we wanted to run this on four processors we can do this and there you turn now notice they're not in order that's just they kind of run independently on their own and however they get printed to the screen it's a somewhat random process so in point-to-point communication which is what we've already taken a short look at in class we're basically explicitly sending one thing to another job I'm sorry to it one one set of data at one rank two to another rank we're passing that set of data so in this case these first few lines are just initializing the program and then we're going to basically set up an array that's going to be filled with the rank ID so V is going to be a numpy array and so if there's two ranks the initial one would be a you know zero five hundred zero so five hundred indicates the length of the array the the value here is going to be the value of the rank itself so if it were to rank zero that'd be fine an array of 500 zeros if its rank one there's going to be an array of 500 ones and they're going to be of type float okay and so then we're just going to explicitly pass those along so if the rank is zero we're going to send it to one if the rank is greater than zero we're going to send it to from rank minus one to rank plus one so we're going to receive from the previous you know if I'm rank three that I'm going to receive from rank two and I'm going to send to rank 4 so we're just passing the data along that's kind of an interesting program but this is what we're gonna do so this is actually very similar to what we did in class the one day and then at the end we're just going to print you know what what the rank is and and itself from each process so if we go ahead and run this run it with the same command except this time the name of the file is send received on PI and there you go so in that case I ran it with four so basically it's saying that Rank 1 received the zeros rank two and again they're printed the screen somewhat randomly so you can't really make much of it you know typically we wouldn't have them all print out on top of one another like this we would only kind of have one of the ranks print the results that kind of finalized results but that's that's how the send and receive program so there's several different types of collective communication the most simplest would be to broadcast so in a broadcast communication we're going to take a single thing and we're gonna send it out to all the processes that single thing distribute it to all the process in a scatter communication we're gonna say take a single thing and and decompose it into multiple things and then send one of those out to each process so this may be say a list that's a 16 of length 16 we could break that into four lists of four each and then we would send each of those out to the individual processes okay so now it'd be a scatter command the gather command is the complement of that so we're basically gonna take it each of the individual things that are at each process and gathered them into a single list so it would be the exact reverse of the scatter command okay and then a parallel reduction is basically we're gonna do some operation in parallel so if we wanted to add up these four numbers one way to do that would be to add one in three in parallel with five and seven and then add the two together so if we added one and three we'd have four if we added five and seven we'd have 12 and then once again we add them to get 16 so here is showing with an addition operation but you could also do subtraction a multiply divide a max of men and or we could write our own operator and so these would all be used with the function reduce but we just tell it which operation to perform this is called a parallel reduction so as an example of a scatter command and this is typical format for an MPI program is that you typically have one rank typically rank zero that acts as a leader in this case creating a list and then sending that list out to all the different processes so we'll go ahead and show you what this program looks like I apologize for that being on the screen so if we run the the scatter program then you can see that we basically create a random grid there in this case since I initialized it on four processors I get a four by four grid and then that is all scattered out each each row of that grid is scattered out to all the process which then just print the results to the screen so pretty simple program but that's the idea behind scatter so then gather as I mentioned is the the complimentary command to scatter so in this case we're going to start the program the same way we're going to initialize and scatter and then up through here then we're going to go ahead and print the array that we received and we're going to square that Ray on each processor and then we're going to send back with we're gonna go back gather back at the at the root and we'll do that you know you can specify where we want to gather to the root has been assigned as rank zero here so we're going to gather it back at the root the rank zero command and then print the results so we're going to send each of those out to be squared each entry to be to be squared and then we'll print the result so this is how that command would look we'll go and run it only on two processors just so you can see a little bit easier what's going on but we could run it on the floor or eight or twelve it doesn't matter so in this case they're just entries 0 1 2 3 that are scattered out and it's printed out I got this eraser of one I got this array 2 3 and then the entries are squared so of course you know 0 squared is 0 1 squared is 1 but 2 squared is 4 and 3 squared is 9 and so then you can see that they're collected there into one list at the end and this is printed out back on the rank 0 so in this this is a you know slightly less interesting and basically we in a broadcast we're just going to send one item out to all the processes so basically we're just going to do the same exact thing except right at the end we're going to say from the rank 0 we're going to create a buffer that just has the string done and then we're going to broadcast that to all the processes to be printed so we're gonna broadcast a single string done out to all of the other ones and then print the result you're gonna have the same the exact same as the previous scatter gathered run the same results except at the end you just have done pretty to the screen twice and if we change this to say four then you'd see the same thing there so now you have a four by four array that gets scattered squared gathered and then done down at the end so finally we'll give an example of the reduce algorithm so in this case the initial part is exactly the same we're going to scatter it square it and this time instead of just doing a simple gather we're going to do a reduction such that not only after we square the elements and they're brought back into one list they're going to be summed so in this case you know if we sent out a four by four we got back a four by four this time we're only going to get back four entries because each say column and the row in the in the array are going to be summed and so to give you a demonstration of this this file so the final so the four by four here at the top is one zero one two three four five six seven eight nine all the way to 15 is what was created each row of that was sent out to each processor to get squared and you can see the results of that here so zero these are what what each of those were received 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 where each individually received at each process they were then squared I'm not printing out the result of that square and then they were done a parallel reduction was done and such that the first entry for me to the process was added up and you know you could verify this if you square 0 square 4 square 8 square 12 and add them all up you're going to get to 24 and the same goes here if you square 1 square v square 9 and score 13 and add them all up you're going to get 276 so that's what that's doing maybe it's a little bit easier if you're just doing on two processors you can let's see what's happening so again if I if I square 0 which is 0 and square 2 which is 4 and add them up I'm going to get 4 here if I Square 1 I get 1 if I square 3 I get 9 and I add them up I'm gonna get 10 so that's the answer there so I'd encourage you to go out and play with these you know programs you can basically copy and paste them right into a Python file be able to run these so they're just a couple of references one of them's from the previous one there's also some FEI MPI for PI documentation out there so this is a very short introduction MPI in general but also more specifically using the MPI 4 PI module
Info
Channel: John Foster
Views: 12,674
Rating: 4.7872338 out of 5
Keywords: UTSA_ME5013, UNIX, Linux, MPI, MPI4Py
Id: Udn9wmmb9YY
Channel Id: undefined
Length: 17min 2sec (1022 seconds)
Published: Mon Nov 19 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.