Job Management

The purpose of this page is to help the users to manage their Slurm jobs, find detailed information of a job like memory usage, CPUs, and how to use job statistics/information to troubleshoot any job failure.

Check Running/Queued Jobs

To check the running/queued jobs, type the squeue command.

squeue

Output

JOBID PARTITION     NAME      USER   ST    TIME  NODES NODELIST(REASON)
769059 interacti  ci-6023_  crushev  CG    0:18     1 discovery-c35
760164 normal    SpringTe     mccg   R    2:59:50   1 discovery-c2
760168 normal      test      andrew  PD    0:00     1 (Resources)
760170 backfill  molecular   kevin   PD    0:00     1 (Priority)
760172 interacti  ellipse_  crushev  PD    0:00     1(ReqNodeNotAvail)

This will show all the jobs in the queue. To display only your jobs , you need to type squeue -u <your-username>. The two fields ST and NODELIST(REASON) can be used to track the job’s progress and identify the reason for any job failure/pending resource allocation.

You can also use the watch squeue command for a live view of the jobs coming to the queue and exiting the queue.

Job State

The ST column above gives the state of your job. Some important state codes which the users need to be aware of are:

State Codes

Meaning

Completed

Completing

Failure

OOM

Out of Memory

Pending

Running

Timeout

CD means that the job has been completed successfully whereas CG denotes that the job is finishing.
F denotes that the job got terminated with non-zero exit code or other failure condition.
OOM says that job experienced out of memory error.
PD denotes that the job has been awaiting resource allocation due to various reasons. You can use the NodeList(Reason) to get more information on why the job hasn’t started.
R says that the job has started running and has resource allocation as requested in the SBATCH.
TO denotes that the job got terminated after reaching its time limit.

NodeList(Reason)

NodeList(Reason) helps to find on which nodes the job is currently running on. Also, in the case of PD Job state, this field will give more information about the reason why the job is in pending state.
(Resources) The job is currently waiting for the resources to become available. The cluster is too busy to run your job at this time. After the requested resources becomes available, the job will begin to run.
(Priority) The job is queued because of other high priority jobs in the queue.
(ReqNodeNotAvail) The node required for the job isn’t currently available. The node may be currently in use or reserved for another job or in an advanced reservation or reserved for maintenance purposes.
(QoSJobLimit) The job’s QoS has reached its maximum job count. This will occur only if you submit large number of jobs which exceeds the QoS limit. To find information about the maximum jobs that you can submit to a partition, refer to the page QoS

The "sacct" command

The sacct command in Slurm can be used to find the useful information/statistics of a job. Thus, it can be used to troubleshoot any problem that occurred during the job execution which lead to job failure.
By Default, the sacct command diplays JobId, JobName, Partition, Account, AllocCPUS, State and ExitCode.
However, you can make use of the --format flag to display the desired fields you want.

The following list of fields can be passed along with the --format flag. To find the list of fields, run the following command:

sacct -e

Output

Account             AdminComment        AllocCPUS           AllocNodes
AllocTRES           AssocID             AveCPU              AveCPUFreq
AveDiskRead         AveDiskWrite        AvePages            AveRSS
AveVMSize           BlockID             Cluster             Comment
Constraints         ConsumedEnergy      ConsumedEnergyRaw   CPUTime
CPUTimeRAW          DBIndex             DerivedExitCode     Elapsed
ElapsedRaw          Eligible            End                 ExitCode
Flags               GID                 Group               JobID
JobIDRaw            JobName             Layout              MaxDiskRead
MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite        MaxDiskWriteNode
MaxDiskWriteTask    MaxPages            MaxPagesNode        MaxPagesTask
MaxRSS              MaxRSSNode          MaxRSSTask          MaxVMSize
MaxVMSizeNode       MaxVMSizeTask       McsLabel            MinCPU
MinCPUNode          MinCPUTask          NCPUS               NNodes
NodeList            NTasks              Priority            Partition
QOS                 QOSRAW              Reason              ReqCPUFreq
ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov       ReqCPUS
ReqMem              ReqNodes            ReqTRES             Reservation
ReservationId       Reserved            ResvCPU             ResvCPURAW
Start               State               Submit              Suspended
SystemCPU           SystemComment       Timelimit           TimelimitRaw
TotalCPU            TRESUsageInAve      TRESUsageInMax      TRESUsageInMaxNode
TRESUsageInMaxTask  TRESUsageInMin      TRESUsageInMinNode  TRESUsageInMinTask
TRESUsageInTot      TRESUsageOutAve     TRESUsageOutMax     TRESUsageOutMaxNode
TRESUsageOutMaxTask TRESUsageOutMin     TRESUsageOutMinNode TRESUsageOutMinTask
TRESUsageOutTot     UID                 User                UserCPU
WCKey               WCKeyID             WorkDir
[crushev@discovery-l2 test4]$

Important fields that can be used to diagnose job issues are:

Field

Description

JobId

Id of the Job.

JobName

Name of the Job.

AllocCPUS

Count of allocated CPUs. Equal to NCPUS.

ReqCPUS

Required number of CPUS.

ReqMem

Minimum memory required for the job in MB. A c in the end denotes Memory Per CPU and a n at the end represents Memory Per Node.

AveRSS

Average memory use of all tasks in the job.

MaxRSS

Maximum memory use of any task in the job.

Start

Initiation time of the job in the same format as End

End

Termination time of the job.

Elapsed

Time taken by the job.

State

State of the job.

ExitCode

Exit code returned by the job.

For more information about the sacct command and how to pass the --format flag along with the above fields to retrieve the job statistics, there’s a separate documentation available at Sacct Command

Common Problems

Some common problems faced by the Slurm users and the troubleshooting tips to diagnose such issues with the help of the sacct command are discussed below.

Out of Memory Issues

Jobs can fail if the memory requested for the job exceeds the actual memory needed for the job to complete successfully. Consider the below example.

script.sh
script.py

#!/bin/bash

#SBATCH --job-name dataset-processing   ## name that will show up in the queue
#SBATCH --output result.out   ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out
#SBATCH --ntasks=1  ## number of tasks (analyses) to run
#SBATCH --cpus-per-task=1  ## the number of threads allocated to each task
#SBATCH --mem-per-cpu=50M  # memory per CPU core
#SBATCH --partition=interactive  ## the partitions to run in (comma seperated)
#SBATCH --time=0-01:00:00  ## time for analysis (day-hour:min:sec)

## Load modules
module load anaconda
conda activate my_env

#Run the program
srun python script.py data/Human_Activity_Recognition_Using_Smartphones_Data.csv

In the script above, 1 Node, 1 CPU, 50MB of memory per CPU, 1 hour of wall time were requested for one single task. Also, the output for the job will be stored in result.out and the partition is set to interactive which is ideal for debugging/short jobs.

After the resource request, anaconda module is loaded using the module system in Discovery. The custom anaconda environment my_env was activated next which has all the packages required for the python script script.py to run. For more detailed explanation about anaconda environments in Discovery, refer to the tutorials at Anaconda Virtual Environments

#Importing packages.
import sys
import os
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

dataset = sys.argv[1]

if dataset == "data/Human_Activity_Recognition_Using_Smartphones_Data.csv":
    print("-----Processing the Human Activity Recognition Dataset-----\n")
else:
    print("Invalid dataset name entered")
    sys.exit(1)

df = pd.read_csv(dataset)

print(df.head(5))

#Data statistics
#Printing the number of rows and columns
print(df.info())

print("The number of rows\n")
print(len(df))

print("The number of columns\n")
print(len(df.columns))

print("Dataframe shape\n")
print(df.shape)

#Check for any null - N/A values)

print("Checking for any N/A values\n")
print(df.isna().values.any())

#Check for any Null values
print("Checking for any null values\n")
print(df.isnull().values.any())

Y = pd.DataFrame(data=df['Activity'])
X = df.drop(['Activity'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X,Y,random_state=1, test_size=0.2)


print('-----------------------------')
print('DecisionTree Test was Called. Wait...')

depths= list(range(1, 31))
trainAccuracy = []
testAccuracy = []

for i in depths:
    clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=i, random_state=1)
    clf_gini.fit(X_train, y_train.values.ravel())

    y_pred_gini = clf_gini.predict(X_test)
    y_train_pred_gini = clf_gini.predict(X_train)

    #Storing the metrics
    trainAccuracy.append(accuracy_score(y_train, y_train_pred_gini))
    testAccuracy.append(accuracy_score(y_test, y_pred_gini))

print(trainAccuracy)
print(testAccuracy)

The script.py performs preprocessing on the Activity Recognition Dataset and runs machine learning model. The path for the dataset is data/Human_Activity_Recognition_Using_Smartphones_Data.csv and the folder structure is shown below.

project
|_ script.sh
|_ script.py
|_ data
   |_ Human_Activity_Recognition_Using_Smartphones_Data.csv

Also, one key thing to note down here is how large the size of the dataset is. The size of the dataset file is 68M. The below command outputs the file size in human readable format(MB).

[username@discovery-l2 data]$ ls -lh
total 68M
-rw-r----- 1 <username> <username> 68M Jul 29 15:12 Human_Activity_Recognition_Using_Smartphones_Data.csv

The python script takes a comman-line argument which is the relative path to the dataset file and if it’s not equal to data/Human_Activity_Recognition_Using_Smartphones_Data.csv, then the python script wouldn’t run and exits. The program preprocesses the dataset and runs the decision tree machine learning algorithm.

Submit the Job

Submit the above job and see how it runs. To submit the above job, run the following command.

sbatch script.sh
Submitted batch job 832679

Job Statistics

After the job exits from the queue, the below sacct command helps to report the job statistics.

sacct -j 832679 --format=jobid,jobname,reqcpus,reqmem,averss,maxrss,elapsed,state%20,exitcode --unit=M

Output

       JobID    JobName  ReqCPUS     ReqMem     AveRSS     MaxRSS    Elapsed      State    ExitCode
   ------------ ---------- -------- ---------- ---------- ---------- ---------- ---------- --------

832679       dataset-p+        1       50Mc                         00:00:33 OUT_OF_MEMORY    0:125
832679.batch      batch        1       50Mc      4.02M      4.02M   00:00:33 OUT_OF_MEMORY    0:125
832679.exte+     extern        1       50Mc          0          0   00:00:33 OUT_OF_MEMORY    0:125
832679.0         python        1       50Mc     70.65M     70.65M   00:00:32 OUT_OF_MEMORY    0:125

Explanation

The above output from running the sacct command says that the job has ran into Out of Memory problem which can be inferred from the State Field. The reason behind this problem is that, the job script requested 50M of Memory per CPU. However, the dataset that the python script is trying to run has a size equal to 68M which is greater than the memory requested for the job. That’s the reason why the job has failed. The reasoning is further strengthened by the values reported under AveRSS and MaxRSS fields. The AveRSS represents the average memory(RAM) taken by the process and MaxRSS represents the maximum memory(RAM) spiked/taken by the process. Slurm Accounting mechanism catches these statistics and make it available to the users through the sacct command. The MaxRSS field reports 70.65M for the job step(832679.0) which is greater than the memory requested(50M) for the job.
Hence, AverRSS and MaxRSS fields are very handy to troubleshoot any jobs that got failed/cancelled due to out of memory problems.
Also, the output file result.out would be generated for the job and this file can also be used to diagnose any problem if the job fails because the reason for failure is also sent by the Slurm to the output file.

On opening the generated result.out file, it shows the following information.

vi result.out
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=832679.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: discovery-c34: task 0: Out Of Memory
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=832679.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.

The above output means that the Slurm detected the job hitting more memory for the job than the requested memory.

The --unit flag at the end of sacct command displays the values in specified unit type. The possible values are -unit=[KMGTP].

Solution

Increasing the --mem-per-cpu to 100M or 150M should be enough for the python script to process the dataset and complete the job successfully.

Time Out Issues

Consider the below slurm script script.sh.

#!/bin/bash

#SBATCH --job-name test   ## name that will show up in the queue
#SBATCH --output result.out   ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out
#SBATCH --ntasks=1  ## number of tasks (analyses) to run
#SBATCH --cpus-per-task=1  ## the number of cpus allocated to each task
#SBATCH --mem-per-cpu=50M  # memory per CPU core
#SBATCH --partition=interactive  ## the partitions to run in (comma seperated)
#SBATCH --time=0-00:01:00  ## time for analysis (day-hour:min:sec)

srun echo "Starting Process"
srun sleep 180
srun hostname
srun echo "Ending Process"

Explanation of Job Script

In the script above, 1 Node, 1 CPU, 50MB of memory per CPU, 1 minute of a wall time for the tasks (Job steps) were requested. Note that all the job steps that begin with the srun command will execute sequentially as one task by one CPU only. Also, the output for the job will be stored in result.out and gets generated inside the directory where the job script script.sh is located.
The first job step will run the Linux echo command and output Starting process. The next job step(2) will execute the Linux sleep command for 180 seconds. The third(3) job step will echo the hostname of the compute node that executed the job. The final job step will just echo out Ending process. Note that these job steps executed sequentially and not in parallel.
One key thing to note down with this job script is that, the wall time for the job is set as 0–00:01:00 which equals 1 minute. However, the job step(2) needs to execute the Linux sleep command for 180 seconds.

Submit the Job

Submit the above job and see how it runs. To submit the above job, run the following command.

sbatch script.sh
Submitted batch job 822800

Watch Live Status of the Job

To watch the live status of the above job, run the watch squeue -u <discovery-username> command

watch squeue -u <discovery-username>

Squeue Command Output

Every 2.0s: squeue -u <discovery-username>                                                                                                                                                       Tue Aug 24 23:04:30 2021

             JOBID PARTITION     NAME     USER              ST       TIME  NODES NODELIST(REASON)
            822800 interacti     test  <discovery-username>  R       0:24      1 discovery-c34

To exit the live status of the watch squeue command, press Ctrl + C

After the job exits from the queue, run the below sacct command to check the status of the job.

sacct -j 822800 --format=jobid,jobname,elapsed,state,exitcode

Sacct Command Output

       JobID    JobName    Elapsed      State ExitCode
------------ ---------- ---------- ---------- --------
822800             test   00:01:17    TIMEOUT      0:0
822800.batch      batch   00:01:19  CANCELLED     0:15
822800.exte+     extern   00:01:17  COMPLETED      0:0
822800.0           echo   00:00:01  COMPLETED      0:0
822800.1          sleep   00:01:18  CANCELLED     0:15

Explanation

The statistics of the above job is displayed in the above output. Rows 1 and 2 are default job steps, with the first being the job script as a whole and the second being the resources needed by the BATCH script. The fourth row denotes the first process(822800.0) which ran using srun and got completed successfully and took a time of 00:00:01 which is one second. However, the second process(822800.1) which is the Linux sleep command got cancelled. This is because the sleep command was specified for 180 seconds which exceeded the specified wall time(1 minute) for the job. Because the second process wasn’t able to finish on time, the job as a whole wasn’t able to complete and hence the first row’s(822800) STATE column outputted as TIMEOUT which is the reason for the job cancellation by Slurm.

Solution

To make the above job to complete the execution of all the job steps successfully, set the wall time of the job to be #SBATCH --time=0:00:05:00 which equals 5 minutes and this will be sufficient enough for the job to complete successfully.

Additional Insights

Also, the result.out generated output file can also be used to troubleshoot job failure. Open the result.out file which is generated.

vi result.out
Starting Process
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** JOB 822800 ON discovery-c34 CANCELLED AT 2021-08-24T23:45:42 DUE TO TIME LIMIT ***
slurmstepd: error: *** STEP 822800.1 ON discovery-c34 CANCELLED AT 2021-08-24T23:45:42 DUE TO TIME LIMIT ***

The first line prints the output of the first job step which is (srun echo "Starting Process"). Because, the second job step(822800.1) which is shown in the last line ran out of time, it got cancelled by the Slurm and the reason for the job cancellation is specified as DUE TO TIME LIMIT. Hence, the whole job(822800) got cancelled by the Slurm.

The "scontrol" command

The scontrol command can also be used to get detailed information about the completed job.

Syntax

scontrol show job <job-id>

Output

   UserId=crushev(723778) GroupId=crushev(723778) MCS_label=N/A
   Priority=191096 Nice=0 Account=nmsu QOS=normal
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:02 TimeLimit=00:10:00 TimeMin=N/A
   SubmitTime=2021-08-13T01:43:43 EligibleTime=2021-08-13T01:43:43
   AccrueTime=2021-08-13T01:43:43
   StartTime=2021-08-13T01:43:43 EndTime=2021-08-13T01:43:45 Deadline=N/A
   PreemptEligibleTime=2021-08-13T01:43:43 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-13T01:43:43
   Partition=normal AllocNode:Sid=discovery-l2:177992
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=discovery-c3
   BatchHost=discovery-c3
   NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=1,mem=100M,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/crushev/sample-python/script.sh
   WorkDir=/home/crushev/sample-python
   StdErr=/home/crushev/sample-python/maxFib.out
   StdIn=/dev/null
   StdOut=/home/crushev/sample-python/maxFib.out
   Power=
   NtasksPerTRES:0

The "seff" command(Slurm Job Efficiency Report)

This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.
Using the seff command, you can find the memory used, how much % of allocated memory is utilized, CPU information, so on.

Syntax

seff <job-id>

Example

seff 769059

Output

Job ID: 769059
Cluster: discovery
User/Group: <user-name>/<group-name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:00:01
CPU Efficiency: 0.11% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:58
Memory Utilized: 4.79 MB
Memory Efficiency: 4.79% of 100.00 MB

Terminating Jobs

The `scancel` or `skill` command

The scancel command is used to kill or end the current state(Pending, running) of your job in the queue.

Syntax: scancel <jobid> or skill <jobid>

scancel 219373

skill 219373

Please note that a user can’t delete the jobs of another user.

References

For more information about the sacct command in Slurm, refer to the official Slurm documentation Slurm’s sacct command

Job Management

Check Running/Queued Jobs

Job State

NodeList(Reason)

The "sacct" command

Common Problems

Out of Memory Issues

Time Out Issues

The "scontrol" command

The "seff" command(Slurm Job Efficiency Report)

Terminating Jobs

The scancel or skill command

References

The `scancel` or `skill` command