Serial vs Parallel Jobs

Running your jobs in series means that every task will be executed one after the other (serially). You can take advantage of the cluster even better when running your jobs in parallel than in series. This way, you could execute much more tasks at once (simultaneously) and achieve a faster result.

Serial Programs

Serial programs are okay for trivial applications with a very simple calculation that don’t require much processing power or speed. Suppose you are in a situation where you would love to run high computational programs that are time demanding and require much more processing speed and power. In such cases, you may have to consider going parallel to achieve results faster.

introduction-to-serial

The image above depicts how a task is serially executed by a single processor.

Parallel Programs

Parallel computing involves having two or more processors solving a single problem. The image above depicts how multiple tasks are being executed by multiple CPUs. The more CPUs you add, the faster the tasks can be done.

Note that a program that doesn’t have any form of Parallelization or Parallel API implementation won’t take advantage of utilizing multiple CPUs.

introduction-to-parallel

Explanation of ntasks

The --ntasks is the number of tasks in a job or job step.

When used in your SBATCH script, it specifies how many instances of your program are to be executed. It’s useful if you have independent tasks that you want to run in parallel within the same SBATCH script or if your program supports communication across computers.

The examples are shown below:

  • Example 1

    #!/bin/bash
    
    #SBATCH --job-name=TestJob
    #SBATCH --ntasks=1
    #SBATCH --time=00:01:00
    
    srun echo "Hello!"

    Output

    Hello

    Explanation

    --ntasks=1 is equal to 1 process. Which means the number of things or tasks to carry out. Because, only one task is specified in the job steps section, srun echo "Hello!" will be executed once.

  • Example 2

    #!/bin/bash
    
    #SBATCH --job-name=TestJob
    #SBATCH --ntasks=2
    #SBATCH --time=00:01:00
    
    srun echo "Hello!"

    Output

    Hello
    Hello

    Explanation

    --ntasks=2 is equal to 2 processes. Which means the number of things or tasks to carry out. Because, two tasks are specified in the job steps section, srun echo "Hello!" will be executed twice.

The examples above were actually executed in a sequential manner. However, the main essence of the --ntasks is to allow for the parallel execution of job steps. Consider the next example.

  • Example 3

    #!/bin/bash
    
    #SBATCH --job-name=TestJob
    #SBATCH --ntasks=3
    #SBATCH --time=00:05:00
    #SBATCH --cpus-per-task=1
    
    srun --ntasks 1 echo Hello 1!
    srun --ntasks 1 echo Hello 2!
    srun --ntasks 1 echo Hello 3!

    Output

    Hello 1!
    Hello 2!
    Hello 3!

    Next, investigate the output of this program by running the sacct command and adding the start and end parameters to it.

    $ sacct --format=jobid,jobname,partition,ncpus,start,end,state

    Output

    ~~~~~~~JobID    JobName  Partition      NCPUS               Start                 End      State
    ------------ ---------- ---------- ---------- ------------------- ------------------- ----------
    221540          TestJob     normal          3 2020-09-11T04:58:01 2020-09-11T04:58:02  COMPLETED
    221540.batch      batch                     3 2020-09-11T04:58:01 2020-09-11T04:58:02  COMPLETED
    221540.exte+     extern                     3 2020-09-11T04:58:01 2020-09-11T04:58:02  COMPLETED
    221540.0           echo                     1 2020-09-11T04:58:02 2020-09-11T04:58:02  COMPLETED
    221540.1           echo                     1 2020-09-11T04:58:02 2020-09-11T04:58:02  COMPLETED
    221540.2           echo                     1 2020-09-11T04:58:02 2020-09-11T04:58:02  COMPLETED

    Explanation

    If you look closely at the start and end time from output of the sacct command above, you will notice that the job steps or number of tasks (221540.0, 221540.1 and 221540.2) started executing the same time at 04:58:02. It ended the same time at 04:58:02 as well although the end time could vary depending on the workload of a given job step.

It means that the job steps were executed in parallel. With the help of the srun command, the job steps can be broken down further by informing Slurm to treat each step as an individual process. This will then assign 1 CPU per job step based on the specification on the resource request section (--cpus-per-task=1).

The Srun Command

srun is a multipurpose command that can be used in a submission script to create job steps and used to launch processes interactively via the terminal. For instance, If you have a parallel MPI program, srun takes care of creating all the MPI processes.

The srun command when used within batch scripts, execute each job step prefixed with the srun command on the compute nodes. When used interactively, the output will clutter your terminal.

The srun command can be used in two ways:

  • Within job scripts

  • Interactively

The jobs can be run interactively using the srun command. Consider the below example which uses srun to execute a python program interactively.

Program Execution with Srun

  1. Login to Discovery and create the file program.py within your home directory and paste the code below then save.

    Python script (program.py)

    txt = "Running jobs interactively with srun command"
    print(txt)
  2. Load the python module. You can run the module spider python command to choose from the list of all python versions on Discovery.

    module load python/3.8.0-gcc-9.2.0-o7ug55w
  3. Once the python module is loaded, the python script can be executed on the compute nodes using the srun command.

    srun -n 1 --time=00:10:00 --partition=normal python program.py

    The -n flag specifies the number of tasks (--ntasks) to run followed by the --time flag, for the duration and the --partition flag, for what partition(normal, backfill, epscor, so on. ) to run your job in.

    Output

    Running jobs interactively with srun command

    You should see the output above printed on console after successfully executing the srun command.

Sequential Execution

Example - Batch script

#!/bin/bash

#SBATCH --ntasks=1

srun sleep 10
srun sleep 20
srun hostname

The above script will initiate the Linux sleep command for 10 seconds followed by another 20 seconds. Then, it prints the Hostname of the compute node that executed the job. To check the statistics information of the job, run the sacct command.

$ sacct -j 227 --format=JobID,Start,End,Elapsed,NCPUS

Output

~~~~~~~JobID               Start                 End    Elapsed      NCPUS
------------ ------------------- ------------------- ---------- ----------
227          2019-12-02T23:16:06 2019-12-02T23:16:37   00:00:31          4
227.batch    2019-12-02T23:16:06 2019-12-02T23:16:37   00:00:31          4
227.0        2019-12-02T23:16:07 2019-12-02T23:16:17   00:00:10          1
227.1        2019-12-02T23:16:17 2019-12-02T23:16:37   00:00:20          1
227.2        2019-12-02T23:16:37 2019-12-02T23:16:37   00:00:00          1

Explanation In the above example, there are 3 job steps and the statistics show that the first job step had to finish before the rest commences. The first step finished at 23:16:17 after which the second step followed and then the third step. This means that the job steps were executed sequentially. The srun command in this context will run your program as many times as specified by the --ntasks. For example, if the --ntasks=2, every command in the job step will be executed twice.

Parallel Execution

Example - Batch script

#!/bin/bash

#SBATCH --ntasks=3
#SBATCH --cpus-per-task=2
#SBATCH --time 0-00:10:00

# Execute job steps
srun --ntasks 1 --exact sleep 10 &
srun --ntasks 1 --exact sleep 20 &
srun --ntasks 1 --exact hostname &
wait

The above script will start the Linux sleep command run for 10 seconds, another Linux sleep for20 seconds, and print the hostname of the compute node that executed the job. To check the statistics of the job, run the sacct command.

$ sacct -j 228 --format=JobID,Start,End,Elapsed,NCPUS

Output

JobID           JobName  Partition      NCPUS               Start                 End      State
------------ ---------- ---------- ---------- ------------------- ------------------- ----------
1901568      subBash.sh     normal          6 2022-07-29T01:00:45 2022-07-29T01:01:05  COMPLETED
1901568.bat+      batch                     6 2022-07-29T01:00:45 2022-07-29T01:01:05  COMPLETED
1901568.ext+     extern                     6 2022-07-29T01:00:45 2022-07-29T01:01:05  COMPLETED
1901568.0         sleep                     2 2022-07-29T01:00:45 2022-07-29T01:01:05  COMPLETED
1901568.1         sleep                     2 2022-07-29T01:00:45 2022-07-29T01:00:55  COMPLETED
1901568.2      hostname                     2 2022-07-29T01:00:45 2022-07-29T01:00:45  COMPLETED

Explanation

In the above example, there are 3 job steps and the statistics show that all the job steps (1901568.0, 1901568.1, 1901568.2) started executing at the same time 01:00:45 but finished at different times. This means that the job steps were executed simultaneously.

The ampersand (&) symbol at the end of every srun command is used to run commands simultaneously. It removes the blocking feature of the srun command which makes it interactive but non-blocking. It’s vital to use the wait command when using ampersand to run commands simultaneously. This is because it ensures that a given task doesn’t cancel itself due to the completion of another task or sibling tasks. In other words, without the wait command, task 0 would cancel itself, given task 1, or 2 completed successfully.

Note that the total number of tasks in the above job script is 3. Also, each job step job will run only once (srun --ntasks 1). The above script requested 2 CPUs for each task (#SBATCH --cpus-per-task=2). This is because individual srun commands won’t share CPU cores. By default slurm will only allocate one task per core, but that task will have 2 CPUs on a Hyper-threaded node. This option will give you two tasks per core.

If you request only one CPU per task, the srun commands my not run simultaneously. For example, submit the following job script:

#!/bin/bash

#SBATCH --ntasks=3
#SBATCH --cpus-per-task=1
#SBATCH --time 0-00:10:00

# Execute job steps
srun --ntasks 1 --exact sleep 10 &
srun --ntasks 1 --exact sleep 20 &
srun --ntasks 1 --exact hostname &
wait

To check the statistics of the job, run the sacct command.

$ sacct -j 1901575 --format=JobID,Start,End,Elapsed,NCPUS

Output

JobID           JobName  Partition      NCPUS               Start                 End      State
------------ ---------- ---------- ---------- ------------------- ------------------- ----------
1901575      subBash.sh     normal          3 2022-07-29T02:48:12 2022-07-29T02:49:38  COMPLETED
1901575.bat+      batch                     3 2022-07-29T02:48:12 2022-07-29T02:49:38  COMPLETED
1901575.ext+     extern                     3 2022-07-29T02:48:12 2022-07-29T02:49:38  COMPLETED
1901575.0      hostname                     1 2022-07-29T02:48:12 2022-07-29T02:48:12  COMPLETED
1901575.1         sleep                     1 2022-07-29T02:48:12 2022-07-29T02:48:22  COMPLETED
1901575.3         sleep                     1 2022-07-29T02:49:17 2022-07-29T02:49:37  COMPLETED
[tahat@discovery-l2 testSrun]$

The above output shows that two tasks (1901575.0 and 1901575.1) started at the same time (02:48:12), but the third task started after the first two are completed. This is because you requested three tasks (--ntasks=3) and one CPU per task (--cpus-per-task=1). Hyperthreading is enabled by default. While you requested three tasks, this means that slurm will allocate 2 cores for your job. While Hyperthreading is enabled, each srun command will run on a single core and the others wait till one of the busy cores are freed.

There are two ways to overcome this issue, and allow all srun commands to run in parallel:

  1. Request 2 cpus per task (--cpus-per-task=2) as shown previously.

  2. Disable Hyperthreading as shown in the following script:

    #!/bin/bash
    
    #SBATCH --ntasks=3
    #SBATCH --cpus-per-task=1
    #SBATCH --time 0-00:10:00
    #SBATCH --hint=nomultithread
    
    # Execute job steps
    srun --ntasks 1 --exact sleep 10 &
    srun --ntasks 1 --exact sleep 20 &
    srun --ntasks 1 --exact hostname &
    wait

Note the --exact used with each srun command (i.e srun --ntasks 1 --exact sleep 10 &) to allow each job step access to only the resources requested for the job step.

Summary

srun in a submission script is used to create job steps. It’s used to launch the processes. If you have a parallel MPI program, srun takes care of creating all the MPI processes. Prefixing srun to your job steps causes the script to be executed on the compute nodes. The -ntasks flag in the srun command is similar to the --ntasks in the #SBATCH directives.

Passing Multiple Arguments

The below example shows how you can pass multiple arguments to your program and make it execute in parallel.

Example

Create a python script with the following content:

#!/bin/python3

import sys
import platform
from datetime import datetime
from time import sleep


current_time = datetime.now()
dt_format = current_time.strftime("%H:%M:%S")


print('Hello From "{}" on host "{}" at {}.'.format(sys.argv[1],platform.node(), dt_format))

The above python program gets the current time, hostname and an argument from user to print Hello From "argument" on host "hostname" at current time.

To run three different instances of the python program, you need to create a job script having three job steps (srun commands) with the required arguments as shown in the following script:

#!/bin/bash

#SBATCH --job-name pytest
#SBATCH --output pytest.out
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=500M
#SBATCH --partition=interactive
#SBATCH --time=0-00:8:00

# job steps
srun --ntasks=1 --exact python pyScript.py "1" &
srun --ntasks=1 --exact python pyScript.py "2" &
srun --ntasks=1 --exact python pyScript.py "3" &
wait

In the resource request section of the batch script, 3 tasks with 2 cpus per task and 500mb of RAM were requested and allocated to each task for 8 minutes.

In the job steps section, the job steps were compartmentalized by specifying how each step should be treated by Slurm (number of processes per step). --exact flag is used to allow a step access to only the resources requested for the step. The python command was called against the python script with different arguments ("1", "2", and "3").

Once the job is completed, you can check the output file.

cat pytest.out
Hello From "2" on host "discovery-c34" at 04:40:59.
Hello From "1" on host "discovery-c34" at 04:40:59.
Hello From "3" on host "discovery-c34" at 04:40:59.

The output file shows that the three steps started at the same time (in parallel).

To check the job statistics, run the sacct command.

$ sacct -j 1901587 --format=JobID,Start,End,Elapsed,NCPUS

Stats Output

JobID           JobName  Partition      NCPUS               Start                 End      State
------------ ---------- ---------- ---------- ------------------- ------------------- ----------
1901587          pytest interacti+          6 2022-07-29T04:40:58 2022-07-29T04:40:59  COMPLETED
1901587.bat+      batch                     6 2022-07-29T04:40:58 2022-07-29T04:40:59  COMPLETED
1901587.ext+     extern                     6 2022-07-29T04:40:58 2022-07-29T04:40:59  COMPLETED
1901587.0        python                     2 2022-07-29T04:40:58 2022-07-29T04:40:59  COMPLETED
1901587.1        python                     2 2022-07-29T04:40:58 2022-07-29T04:40:59  COMPLETED
1901587.2        python                     2 2022-07-29T04:40:58 2022-07-29T04:40:59  COMPLETED

Explanation Take a closer look at the start and end time of each job step, one can infer that all tasks ran independently in parallel. It started at the same time. You’d also notice that the order in which the job steps were specified is different from the order of the output. In out batch script: 1,2,3 and in out output: 2,1,3

Using --multi-prog

--multi-prog runs a job with different programs and different arguments for each task. In this case, the executable program specified is actually a configuration file specifying the executable and arguments for each task. The above example in the Passing Multiple Arguments section can be done differently by passing --multi-prog flags to the srun command. Specify the external file that contains the number of tasks to execute. This can also be used to run different executables at the same time. Consider the below example.

  • Slurm Script 'script.sh'

  • File 'file.conf'

#!/bin/bash

#SBATCH --job-name pytest
#SBATCH --output pytest.out
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=2
#SBATCH --mem-per-cpu=500M
#SBATCH --partition=interactive
#SBATCH --time=0-00:8:00

srun --ntasks=3 -l --preserve-env --multi-prog ./file.conf
0 python pyScript.py "1"
1 python pyScript.py "2"
2 python pyScript.py "3"

This file contains the instruction and lists the steps (tasks) to be run. The numbering will begin with zero followed by the executables. Note that the executables could vary.

Output

2: Hello From ""3"" on host "discovery-c34" at 04:54:25.
0: Hello From ""1"" on host "discovery-c34" at 04:54:25.
1: Hello From ""2"" on host "discovery-c34" at 04:54:25.

Explanation

In the resource request section of the batch script, 3 tasks with 2 CPU and 500mb of RAM were requested and allocated to each task for 08 minutes.

The srun command informs Slurm to run the multiple programs specified in an external file and to treat each of the steps in the file as an individual task. However, make sure that the total number of tasks specified as a flag -ntasks=3 is equal to the total number of steps in the file.conf. It should also be equal to the total number of tasks specified in the resource request section --ntasks=3. The -l flag passed to the srun command means that it should prepend the task number to lines of the result file as shown in the output above.

Running MPI Jobs

MPI is a standard library that’s used to send messages between multiple processes. These processes can be located on the same system (a single multi-core system) or on a collection of distributed servers. It’s an efficient inter-process Communication. MPI is a set of function calls and libraries that implement a distributed execution of a program. Distributed doesn’t necessarily mean that you must run your MPI job on many machines. In fact, you could run multiple MPI processes on a laptop.

MPI Variations

stdout

Description

MPI

Message passing library standard

OpenMPI

Open Source implementation of MPI library.

OpenMp

Compiler add-on

There are other implementations of MPI such as: MVAPICH, MPICH and IntelMPI

MPI Benefits

  • Thread based parallelism

  • Better performance on large shared-memory nodes.

  • Uses fastest available interconnection.

  • No need to recompile your program on every cluster.

  • Portable and easy to use

The below C program starts 3 processes in which each of those processes would communicate with each other using MPI.

Example

  • C program 'program.c'

  • Batch script 'script.sh'

#include "stdio.h"
#include <stdlib.h>

#include <mpi.h>
int main(int argc, char *argv[])
{
  int tid,nthreads;
  char *cpu_name;

  MPI_Init(&argc,&argv);

  MPI_Comm_rank(MPI_COMM_WORLD, &tid);

  MPI_Comm_size(MPI_COMM_WORLD, &nthreads);

  cpu_name = (char *)calloc(80,sizeof(char));

  gethostname(cpu_name,80);

  printf("Hi MPI user: from process = %i on machine=%s, of NCPU=%i processesn",tid, cpu_name, nthreads);
  MPI_Finalize();
  return(0);
}
#!/bin/bash

#SBATCH --job-name=MpiJob
#SBATCH --output=MpiJob.out
#SBATCH --time=00:10:00
#SBATCH --ntasks=3
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=100M

module load mpi/openmpi-x86_64

mpicc -o program program.c

srun -n 3 --mpi=pmi2 ./program

Output

  hello MPI user: from process = 0 on machine=discovery-c6, of NCPU=3 processes
  hello MPI user: from process = 2 on machine=discovery-c6, of NCPU=3 processes
  hello MPI user: from process = 1 on machine=discovery-c6, of NCPU=3 processes

Explanation

In the job steps, first the C program gets compiled by using MPI compiler and then, srun command executes the program. The -n flag is the number of MPI processes to run and note that it corresponds to the number of --ntasks. However, if -n isn’t specified, --ntasks=3 will be the default number of processes to run.

The script executed 3 processes and the incremented integer values show the communication between the processes. All processes executed on discovery-c6 node. On adjusting the value of the -n flag, the number of execution processes will be adjusted. Therefore, if it’s set to 1 process, only one print statement will be shown: hello MPI user: from process = 0 on machine=discovery-c6, of NCPU=1 processes.

Running OpenMp Jobs

OpenMP (Open Multiprocessing) is an add-on in compiler. It targets shared memory systems i.e where processor shared the main memory and is based on thread approach.

It launches a single process which in turn can create n number of threads as desired. Depending on a particular task it can launch desired number of threads as directed by user.

OpenMP is a set of compiler hints and function calls to enable you to run sections of your code in parallel on a shared memory parallel computer. In OpenMp all threads share memory and data.

In the below example, the c program starts up 4 threads in parallel.

Example

  • C Script 'program.c'

  • Batch script 'script.sh'

#include <stdio.h>
#include <omp.h>

int main(int argc, char *argv[])
{
  #pragma omp parallel
  {
    int NCPU,tid,NPR,NTHR;

    /* get the total number of CPUs/cores available for OpenMP */
    NCPU = omp_get_num_procs();

    /* get the current thread ID in the parallel region */
    tid = omp_get_thread_num();

    /* get the total number of threads available in this parallel region */
    NPR = omp_get_num_threads();

    /* get the total number of threads requested */
    NTHR = omp_get_max_threads();

    /* only execute this on the master thread! */
    if (tid == 0) {
        printf("%i : NCPUt= %in",tid,NCPU);
        printf("%i : NTHRt= %in",tid,NTHR);
        printf("%i : NPRt= %in",tid,NPR);
    }
    printf("%i : hello user! I am thread %i out of %in",tid,tid,NPR);
  }
  return(0);
}
#!/bin/bash
#SBATCH --job-name OmpJob		## Job name
#SBATCH --output Omp.out		## Output file
#SBATCH --time 00:30:00			## Amount of time needed DD-HH:MM:SS
#SBATCH --cpus-per-task 4 	## Number of threads
#SBATCH --mem-per-cpu 100M 	## Memory per cpu

module load openmpi

gcc -fopenmp program.c -o program

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 ./program

Output

3 : hello user! I am thread 3 out of 4
2 : hello user! I am thread 2 out of 4
1 : hello user! I am thread 1 out of 4
0 : NCPU        = 4
0 : NTHR        = 4
0 : NPR         = 4
0 : hello user! I am thread 0 out of 4

Explanation

Slurm by default doesn’t know what cores to assign to what process it runs. For threaded applications, you need to make sure that all the cores you request are on the same node.

The OpenMP script is an example that all the cores are on the same node, and lets Slurm know which process gets the cores that you requested for threading.

OMP_NUM_THREADS environment variable is used to specify the default number of threads to use in parallel regions. By adjusting the value of the OMP_NUM_THREADS environment variable, one can adjust the number of execution threads. But in this case, it’s set to the number of threads that the job wants to execute.

stdout Description

3 : hello user! I am thread 3 out of 4

Thread id

2 : hello user! I am thread 2 out of 4

Thread id

1 : hello user! I am thread 1 out of 4

Thread id

0 : NCPU = 4

Total number of available CPUs for OpenMp

0 : NTHR = 4

Total number of threads requested

0 : NPR = 4

Threads available

0 : hello user! I am thread 0 out of 4

Thread id