How Slurm Works?

In creating a Slurm script, there are 4 main parts that are mandatory in order for your job to be successfully processed.

Breakdown of Bash Script

  1. Shebang The Shebang command tells the shell (which interprets UNIX commands) to interpret and run the Slurm script using the bash (Bourne-again shell) shell.

    This line should always be added at the very top of your SBATCH/Slurm script.

    #!/bin/bash
  2. Resource Request In this section, the amount of resources required for the job to run on the compute nodes is specified. This informs Slurm about the name of the job, output filename, amount of RAM, Nos. of CPUs, nodes, tasks, time, and other parameters to be used for processing the job.

    These SBATCH commands are also know as SBATCH directives and must be preceded with a pound sign and should be in an uppercase format as shown below.

    #SBATCH --job-name=TestJob
    #SBATCH --output=TestJob.out
    #SBATCH --time=1-00:10:00
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1
    #SBATCH --mem-per-cpu=500M
  3. Dependencies In this section, loads all the software that your project needs to run the program scripts. For example, if you’re working on a Python project, you’d definitely require the Python software or module to interpret and run your code. Please visit the link → Module Environments and Commands page for more details about using modules on Discovery.

    module load python
  4. Job Steps Here, specify the list of tasks to be carried out.

    srun echo "Start process"
    srun hostname
    srun sleep 30
    srun echo "End process"

Putting it all together

Please note that the lines with the double pound signs (##) are comments when used in batch scripts.

## Shebang
#!/bin/bash

## Resource Request
#SBATCH --job-name=TestJob
#SBATCH --output=TestJob.out
#SBATCH --time=1-00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=500M

## Job Steps
srun echo "Start process"
srun hostname
srun sleep 30
srun echo "End process"

In the script above, 1 Node, 1 CPU, 500MB of memory per CPU, 10 minutes of a wall time for the tasks (Job steps) were requested. Note that all the job steps that begin with the srun command will execute sequentially as one task by one CPU only.

The first job step will run the Linux echo command and output Start process. The next job step(2) will echo the Hostname of the compute node that executed the job. The next job step will execute the Linux sleep command for 30 seconds. The final job step will just echo out End process. Note that these job steps executed sequentially and not in parallel.

It’s important to set a limit on the total run time of the job allocation. This helps the Slurm manager to handle prioritization and queuing efficiently. The above one is a very simple script that takes less than a second. Hence, it’s important to specify the run-time limit so that Slurm doesn’t see the job as one that requires a lot of time to execute.

It’s important to keep all #SBATCH lines together and at the top of the script. No bash code or variables settings should be done until after the #SBATCH lines.

#SBATCH Directives/Flags

Here are the commands that will be useful for your job submission.

Your script should begin with shebang command #!/bin/bash

SBATCH is used to submit a job script for later execution. It defines queue, time, notifications, name, code, and set-up. SBATCH scripts are unique in how they’re read. In shell scripts, any line that starts with # is considered a comment.

Any comment that starts with the word #SBATCH in all caps is treated as a command by Slurm. To comment out a Slurm command, put a second # at the beginning of the line.

#!/bin/sh

#SBATCH --job-name myJobName 	    ## The name that will show up in the queue
#SBATCH --output myJobName-%j.out   ## Filename of the output; default is slurm-[joblD].out
#SBATCH --partition normal          ## The partition to run in; default = normal
#SBATCH --nodes 1 		    ## Number of nodes to use; default = 1
#SBATCH --ntasks 3 		    ## Number of tasks (analyses) to run; default = 1
#SBATCH --cpus-per-task 16 	    ## The num of threads the code will use; default = 1
#SBATCH --mem-per-cpu 700M          ## Memory per allocated CPU
#SBATCH --time 0-00:10:00	    ## Time for analysis (day-hour:min:sec)
#SBATCH --mail-user yourlD@nmsu.edu ## Your email address
#SBATCH --mail-type BEGIN 	    ## Slurm will email you when your job starts
#SBATCH --mail-type END 	    ## Slurm will email you when your job ends
#SBATCH --mail-type FAIL            ## Slurm will email you when your job fails
#SBATCH --get-user-env 		    ## Passes along environmental settings

The "cpus-per-task" value times the "ntasks" value needs to be in the range of the "nodes" thread value. For example, 1 node is a max of 48 threads, so "cpus-per-task" value times the "ntasks" must be less than or equal to 48, otherwise you will get back an error.

--job-name

Specifies a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just sbatch if the script is read on sbatch’s standard input.

--output

Instructs Slurm to connect the batch script’s standard output directly to the filename. If not specified, the default filename is slurm-jobID.out

--partition

Requests a specific partition for the resource allocation (gpu, interactive, normal). If not specified, the default partition is normal.

--nodes

Requests a number of nodes assigned to the job. If this parameter isn’t specified, the default behavior is to assign enough nodes to satisfy the requirements of the --ntasks and --cpus-per-task options. However, assume that you specified one node (--node 1) and 32 tasks (--ntasks 32) in your job script. This means that your job requires 32 CPUs to run. Now the problem with this is that if there is no single node with that many CPUs, your job will fail with a resource error because you restricted it to one node. Therefore, it’s advisable that the --node directive be left out in your job submission scripts.

--ntasks

This option advises the Slurm controller that job steps run within the allocation will launch a maximum of number tasks and offer enough resources. The default is 1 task per node, but note that the --cpus-per-task option will change this default.

--cpus-per-task

Advises the Slurm controller that ensuing job steps will require ncpus number of processors per task. Without this option, the controller will just try to assign one processor per task. For instance, consider an application that has 4 tasks, each requiring 3 processors. If HPC cluster is comprised of quad-processors nodes and simply ask for 12 processors, the controller might give only 3 nodes. However, by using the --cpus-per-task=3 options, the controller knows that each task requires 3 processors on the same node, and the controller will grant allocation of 4 nodes, one for each of the 4 tasks.

--mem-per-cpu

This is the minimum memory required per allocated CPU. Note: It’s highly recommended that the users must specify --mem-per-cpu. If not, the default setting of 500MB will be assigned per CPU.

--time

Sets a limit on the total run time of the job allocation. If the requested time limit exceeds the partition’s time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition’s default time limit. A time limit of zero requests that no time limit be imposed. The acceptable time format is days-hours:minutes:seconds. Note: It’s mandatory to specify a time in your script. The jobs that don’t specify a time will be given a default time of 1-minute after which the job will be killed. This modification has been done to implement the new backfill scheduling algorithm and it won’t affect partition wall time.

--mail-user

Defines user who will receive email notification of state changes as defined by --mail-type.

--mail-type

Notifies user by email when certain event types occur. Valid type values are BEGIN, END, FAIL. The user to be notified is indicated with --mail-user.