Henry2: Basic HPC Workshop: Parallel Jobs

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to part four of our introduction to HPC video workshop at this point you should have already completed part one of the tutorial along with watching the video on the acceptable use policy you should also have completed part two of the tutorial on storage and file transfer along with the video on navigating the different types of HPC storage finally you should have completed part 3 of the tutorial on running jobs in this final episode of the basic HTML jobs there are a lot of terms that are used while describing parallel computing so first we'll spend some time defining these terms will define jobs tasks and threads will define Hardware terms like nodes processors and cores and I'll show you how to find hardware specs for the different nodes on the cluster then we'll define software terms and these are the terms used when writing or running parallel programs this will include additional discussion on tasks and threads and it will include the definition of serial jobs shared memory jobs and distributed memory jobs then finally we will discuss and run parallel jobs in the tutorial on running applications we went over a basic batch script to submit a job one of the required elements of the script was B sub - n where we specify the number of cores sometimes it's more accurate to say that n is for tasks rather than cores a job is a set of tasks and you usually ask for one core per task our sample our script did a single task it created a PDF file with some temperatures in it this was a very small example and if that's all you ever needed to do you probably wouldn't be watching an HPC tutorial you probably want to submit a job that either does a giant task or a giant number of little tasks before we go any further let's define jobs and tasks a job consists of one or more tasks a serial job can process only one task at a time it performs the tasks in serial a parallel job can process multiple tasks at a time it performs tasks in parallel a processing element is the smallest computing device capable of executing that task a serial job can only make use of one processing element in a parallel job can make use of multiple processing elements in the past a cpu was the processing element for a computer CPU stands for central processing unit and CPUs were commonly referred to as processors later multiple CPUs were incorporated into a node now individual CPUs have multiple cores in each of those cores is a processing element CPUs with multiple cores are called processors so what we have now is that a node contains one or more processors and each processor has multiple cores I want to point out that you don't need to be on the HPC cluster to use multiple cores in fact you are probably using multiple cores right now I'm gonna open up a browser and type cheapest laptop let's see what we can find $74 what can we get for $74 it says dual-core this cheapest laptop has two cores if we look into something a bit more pricey like here's a computer it has multiple cores this one can be configured with eight or sixteen or twenty eight cores notice the one you buy one of these you can also specify how much memory you want you can specify a bunch of other things as well Here I am logged in Henry - and I'm gonna look at the different types of nodes available Ellis host shows the list of host and host is another name for knows LS host shows the name of the node processor model the number of cores on the node and the maximum memory on the node if we count how many lines were in that output it shows that there are 980 nodes we can parse through the output to list the nodes with GPUs if we count the lines there are currently 32 nodes with attached DP use there are many nodes that have the same model of processor there are 4 nodes that have a 5 2 6 9 0 but there are 74 nodes that have this type of processor usually if 2 nodes have the same type of processor they also have the same amount of memory but that is not always the case for example this type of processor might have over 500 gigabytes or it can have about 130 bigger bytes there are other commands you can use to find out more about the nodes on the cluster you can check OS and kernel information with you name you can check the version of Linux by looking in the sin oh s release file type LS CPU to find information about the node you are on this node has two processors each of the processors have eight cores and that means there's 16 cores total on the node you can see similar information by looking at the CPU input file LS CPU and CPU info don't have the memory information so for that you can look at memens o the command to look at GPU info is nvidia SMI but there are no GPUs on the login node let's do an interactive log into a GPU node B sub - is is for interactive session we're going to ask for one core 5 minutes of time and we have to specify the GPU queue for a GPU node this next part has to be added to make sure the GPU is engaged and tcsh opens a shell our job was submitted we're waiting for a node and here we are type nvidia SMI and here's the information for the GPU to leave the interactive session type exit you in our documentation for Henry - we refer to a node having multiple processors each having multiple cores the software definitions can be confusing - and might vary depending on who you talk to we are going over these definitions because these are the terms used in all of the Henry - documentation a job consists of everything you write in a script and submit to LSF when you run a job you are doing tasks you could do one or several tasks the tasks can be related but they don't have to be if the work to be done by a task can be broken up into smaller pieces those pieces are called threads tasks are sometimes called processes and these terms can be used interchangeably at the end of the day the thing we want you to understand is how to request resources for your job you have to request the proper resources to run your job and that means you have to specify one core for each thing that you're doing if your job is only doing one thing request a single core if your job is broken into simultaneous tasks then request one core per task and if those tasks can be further broken down into threads then request one core per thread here is a diagram of a single node the node has multiple cores it has memory that is shared by all the cores and it has a small amount of disk space there are almost a thousand nodes on the cluster the nodes communicate with each other through a network you submit a job with LSF and if the job is a parallel job it can be broken up into tasks or threads you need to reserve one core for each task or thread the network is the hardware that allows the two nodes to communicate but the software that allows this communication is called MPI or message passing interface if you or whoever wrote your application did not program the code to use MPI your code cannot communicate over the network without MPI the cores on one node cannot access the memory of another node that means that all of the tasks must be confined to one node where they all have access to the same memory they share the memory parallel jobs they can only make use of one node are called shared memory jobs if your code was written with MPI the tasks can be distributed across multiple nodes because the memory used by the application can be distributed across nodes parallel jobs they can use multiple nodes are called distributed memory jobs now that we've covered all those definitions it's time to finally run some parallel jobs first we'll review the sample batch script used in the previous tutorial then we'll discuss the modifications to the batch script that will allow you to submit basic shared or distributed memory jobs this was the sample batch script used in the previous tutorial and it contains everything you need to run a simple job to run the examples in this tutorial we're only going to add two more items to this list the first optional argument this list is B sub - X which specifies exclusive youth of the node the second optional argument is the B sub minus R span specifier and that instructs LSF how to distribute the requested cores over the nodes let's go over the first optional specifier B sub minus X when should you use minus X there is no hard and fast rule so please just keep the following in mind the resources are shared and unless you specify otherwise you could be sharing a node with others remember from watching the video on the acceptable use policy that the AUP means play nice with others do not over subscribe a node by using a disproportionate amount of cores memory or disk compared to what you request in your batch script on the other hand you don't want to under subscribe a node requesting more resources than necessary is not an efficient use of resources and it unnecessarily increases q waits for all users the next optional specifier is B sub - our span the span specifier tells LSF how to arrange the chords over the nodes the options are to specify that all the cores must be on a single node with hosts equal 1 the next option is that you can specify how many cores per node with P tile if you don't care how your tasks are arranged and the arrangement does not affect the performance of your code you should go ahead and leave this out to review for a serial job request 1cor if your code is memory intensive request exclusive use of the node for a shared memory job request the number of cores that will be used by the application and make sure all of those cores are confined to one node let's go back to the requesting cores part if you wrote the application that you are running then you probably know how many cores will be used but what if you didn't write it what if you downloaded it from git or installed it with Conda and you've never used it before the short answer is you need to read the documentation but in general an application will use a default number of threads and the default might be one it might be some other fixed number like 4 or 8 or or it might be all detectable cores on the machine when looking at documentation for shared memory jobs the parallel parts are almost always described as threads if your application automatically spawns threads according to how many cores are on the node then requests the minimum number of desired cores and requests exclusive use of the node in this next exercise on running a shared memory job you will be running an application that you didn't write yourself the code is a version of hello world each thread of the program says hello and reports the name of the node it is running on the documentation states that the code is shared memory and will spawn threads equal to that specified in the environment variable OMP num threads to run the application first copy it to your scratch directory take a look at the included sample batch script and then submit the job try changing the number of threads used and rerun confirm the threading behavior using an interactive login make sure you are on a computer before running anything use H top and top to help you look at what the code is doing Here I am on Henry - I'm in my scratch directory in the guide folder that I created in previous exercises copy the parallel examples directory to this location CD into the directory look at the script submit underscore shared CSH it requests eight cores exclusive use and specifies that all of the cores should be placed on one node OMP num threads is the environment variable that controls the threading behavior and we have set it to four let's go ahead and submit the job the job does do some calculations and it takes between 30 and 60 seconds to run I'm going to clear the screen now our job is done the error file from LSF is empty and here's the output the output is what we expected it says hello from the thread number from the host and all the hosts are the same since we requested host equal 1 the application only use four cores but we requested eight if we hit space to go down and look at resource usage summary we can see that the average memory used was 2.3 2 megabytes and that is not very much we did not need to use minus X because of memory for this application so let's modify the submit script and try again I'm gonna leave the number of cores requested to eight but I'm going to use more cores so I'm going to change OMP num threads to eight I'm going to comment out the minus X and I'm gonna save it I'll resubmit the job the job is running and now it's complete let's check the output again I'll look at the latest LSF output this time it used all eight cores that we requested it looks like we know what the code is doing from these initial tests but let's confirm it with an interactive session P sub - is request one core and - X so we might use more than one core that's why we'll use - X you don't need the span when you're only requesting one core but I'm gonna put it in in case you want to resubmit this with more than one core you need to make sure all the cores are on the node and when you use the - R with a span when you use it in a command line like this you need to put the quotes in there let's ask for ten minutes and open a show now we're on a computer let's confirm by echo hostname let's do LS CPU and that shows that the node that we're on has two processors each processor has four cores so there are eight cores on the node that we requested to run the code first we need to set the environment variable on P non threads and we'll set it to four we're going to run the code and put it in the background the standardout comes right to the screen even though we put it in the background so hit enter and type H to H top shows that I am running four instances of the program hello underscore shared so it looks like I am fully utilizing four cores on this node we're gonna do that one more time and I'll show you top so let's run it put it in the background and you notice H top moved it was dynamic if you want a time slice of what's going on top - and one four one time and you have to add the - capital H to get all the threads and there we go everything looks good let's exit the interactive session when running a distributed memory job request and according to the number of MPI tasks or processes MPI jobs need to set the MPI environment and they need to use MPI run that means you should be loading a module that has MPI and you need to add MPI run before calling the application note that with LSF the value of n is passed to MPI roam do not use an argument to MPI run unless there is a reason that it should be different from n you also may want to specify the span which means specifying the number of tasks to put on each node there are two reasons why you might want to specify P tile and in both cases you should also include - X the first reason to limit the number of MPI tasks per node is because each task takes a lot of memory if the reason you are using P tile is because you are dealing with memory limitations you don't want other jobs competing for the memory as well the second reason could be that you are running a hybrid code where MPI tasks may spawn threads if the MPI tasks spawn threads you need to use - X so that LSF does not assign additional jobs to those notes in this next exercise you will be running a distributed memory job the code is the same HelloWorld that you ran in the previous exercise except that it has been paralyzed with MPI each thread of the program says hello and reports both the name of the node it is running on in which MPI tasks it was spawned from using the same directory as the shared memory example open the sample LSF script and look at the parameters submit the job and look at the LSF monitoring and LSF output change the parameters and run again finally remove MPI run from the script and see what happens Here I am on Henry - I am still in the parallel directory and I'm gonna look at submit underscore MPIC Sh we're asking for six MPI tasks there should only be two tasks per node so there's going to be three nodes used we're asking for exclusive use of the node because this code is going to spawn four threads for every task that means for every one of these tasks four threads will be spawned that means there'll be 24 threads running simultaneously we have to load an MPI module and we have to use MPI run in front of the executable let's run it while the code is running let's do B jobs - L that's for long form it gives more information I want to look at this this says that it reserved six cores and there are three nodes one two three three nodes with two cores assigned to each node let's look at the output the error file is empty let's look at the let's look at the output file and here we have lots of hellos there should be three nodes that were used so let's see n 2 D 2 1 8 and 6 and those are the only nodes shown so it did correctly reserved 3 nodes there should be 6 MPI tasks so here we go 1 through 6 there are 6 tasks each task should have spawned 4 threads so here we go 1 through 4 so it looks like it worked let's modify the submission script and try again Nano submit underscore MPI dot CSH and six tasks let's change it to change it to eight and P tile equals two let's change it to one okay if we if we're asking for 8 and we only put one here that means we need 8 nodes and this is sort of a waste of nodes but let's just make it bigger 8 so now we should reserve 8 nodes and there's going to be 1 MPI tasks on each node but they should spawn 8 threads each so we should be running 64 tasks I specified pital equals 1 and 8 tasks per node but what if I get a node with 32 cores then I'm being very wasteful so let's go back to the HPC web site and see if we can specify an eight-core node so we're not wasting resources go to the HPC web site click documentation running jobs go to the generic template for batch scripts scroll down and hit select resource type we want to select a resource type of a node with 8 cores so let's go to LSF resources for more information on this scroll down to resource by processor type and each node has two of the same model processors but the resource type description is in terms of processors so DC is for dual-core and that means there's two cores on the processor four cores on the node so what we want is quad core because we want to make sure that there are eight cores on the node so QC is four quad core I am going to copy this specification and let's put it in our script doesn't matter where you put it I'm gonna go ahead and put it here it needs to start with B sub all the commands for LS f start with visa and I'm going to paste it capital R for resource go over and change the resource type to quad-core processor quad-core processor means there's four cores on the processor eight cores on the node we're requesting eight MPI tasks one task per node eight threads per task there should be 64 total threads let's save it and run it let's look at the output I'm going to use less the error file is empty so that's good and a copy and paste there are eight MPI tasks there are eight MPI threads if we go and count the lines there should be eight different nodes used and there should be 64 separate threads that were used let's go back to the instructions it says edit the submit script to remove MPI run and resubmit what happened so we'll open the submission script again and this time we're going to pretend that we forgot to do the MPI run you're just going to get rid of it and run it there is no error this is the output we look at the file and let's see we asked for eight MPI tasks but it only gave us one there were eight thread spawned but there's some bug in the code because it thinks it's saying hello from thread five all of these threads were on the same node but we still reserved eight nodes so we were only using one node out of the eight that we asked for so that was a lot so go ahead and pause the video and try this exercise a couple more times change the parameters see what you get and go back to Henry - and look at the resource types congratulations on finishing the basic HPC workshop for more information always check back with our website check the documentation on running jobs and click support links for learning scroll down to the recommended resources on learning more about high-performance computing to be notified of new videos please subscribe to the oh i th pc youtube channel to find more information about a specific video click on the video and scroll down to show more

Info

Channel: OIT HPC

Views: 1,758

Rating: 5 out of 5

Keywords: hpc, henry2, parallel jobs, parallel computing, lsf, hpc workshop, ncsu, shared memory, distributed memory, mpi

Id: 0kGMP3JJkSw

Channel Id: undefined

Length: 29min 39sec (1779 seconds)

Published: Tue Mar 31 2020