Job Management
The purpose of this page is to help the users to manage their Slurm jobs, find detailed information of a job like memory usage, CPUs, and how to use job statistics/information to troubleshoot any job failure.
Check Running/Queued Jobs
To check the running/queued jobs, type the squeue command.
squeue
Output
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
769059 interacti ci-6023_ crushev CG 0:18 1 discovery-c35
760164 normal SpringTe mccg R 2:59:50 1 discovery-c2
760168 normal test andrew PD 0:00 1 (Resources)
760170 backfill molecular kevin PD 0:00 1 (Priority)
760172 interacti ellipse_ crushev PD 0:00 1(ReqNodeNotAvail)
This will show all the jobs in the queue. To display only your jobs , you need to type squeue -u <your-username>
. The two fields ST and NODELIST(REASON) can be used to track the job’s progress and identify the reason for any job failure/pending resource allocation.
You can also use the watch squeue command for a live view of the jobs coming to the queue and exiting the queue. |
Job State
The ST column above gives the state of your job. Some important state codes which the users need to be aware of are:
State Codes |
Meaning |
CD |
Completed |
CG |
Completing |
F |
Failure |
OOM |
Out of Memory |
PD |
Pending |
R |
Running |
TO |
Timeout |
-
CD means that the job has been completed successfully whereas CG denotes that the job is finishing.
-
F denotes that the job got terminated with non-zero exit code or other failure condition.
-
OOM says that job experienced out of memory error.
-
PD denotes that the job has been awaiting resource allocation due to various reasons. You can use the NodeList(Reason) to get more information on why the job hasn’t started.
-
R says that the job has started running and has resource allocation as requested in the SBATCH.
-
TO denotes that the job got terminated after reaching its time limit.
NodeList(Reason)
-
NodeList(Reason) helps to find on which nodes the job is currently running on. Also, in the case of PD Job state, this field will give more information about the reason why the job is in pending state.
-
(Resources) The job is currently waiting for the resources to become available. The cluster is too busy to run your job at this time. After the requested resources becomes available, the job will begin to run.
-
(Priority) The job is queued because of other high priority jobs in the queue.
-
(ReqNodeNotAvail) The node required for the job isn’t currently available. The node may be currently in use or reserved for another job or in an advanced reservation or reserved for maintenance purposes.
-
(QoSJobLimit) The job’s QoS has reached its maximum job count. This will occur only if you submit large number of jobs which exceeds the QoS limit. To find information about the maximum jobs that you can submit to a partition, refer to the page QoS
The "sacct" command
-
The sacct command in Slurm can be used to find the useful information/statistics of a job. Thus, it can be used to troubleshoot any problem that occurred during the job execution which lead to job failure.
-
By Default, the sacct command diplays JobId, JobName, Partition, Account, AllocCPUS, State and ExitCode.
-
However, you can make use of the --format flag to display the desired fields you want.
-
The following list of fields can be passed along with the --format flag. To find the list of fields, run the following command:
sacct -e
Output
Account AdminComment AllocCPUS AllocNodes AllocTRES AssocID AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize BlockID Cluster Comment Constraints ConsumedEnergy ConsumedEnergyRaw CPUTime CPUTimeRAW DBIndex DerivedExitCode Elapsed ElapsedRaw Eligible End ExitCode Flags GID Group JobID JobIDRaw JobName Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel MinCPU MinCPUNode MinCPUTask NCPUS NNodes NodeList NTasks Priority Partition QOS QOSRAW Reason ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS ReqMem ReqNodes ReqTRES Reservation ReservationId Reserved ResvCPU ResvCPURAW Start State Submit Suspended SystemCPU SystemComment Timelimit TimelimitRaw TotalCPU TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID User UserCPU WCKey WCKeyID WorkDir [crushev@discovery-l2 test4]$
-
Important fields that can be used to diagnose job issues are:
Field |
Description |
JobId |
Id of the Job. |
JobName |
Name of the Job. |
AllocCPUS |
Count of allocated CPUs. Equal to NCPUS. |
ReqCPUS |
Required number of CPUS. |
ReqMem |
Minimum memory required for the job in MB. A c in the end denotes Memory Per CPU and a n at the end represents Memory Per Node. |
AveRSS |
Average memory use of all tasks in the job. |
MaxRSS |
Maximum memory use of any task in the job. |
Start |
Initiation time of the job in the same format as End |
End |
Termination time of the job. |
Elapsed |
Time taken by the job. |
State |
State of the job. |
ExitCode |
Exit code returned by the job. |
For more information about the sacct
command and how to pass the --format
flag along with the above fields to retrieve the job statistics, there’s a separate documentation available at Sacct Command
Common Problems
Some common problems faced by the Slurm users and the troubleshooting tips to diagnose such issues with the help of the sacct
command are discussed below.
Out of Memory Issues
Jobs can fail if the memory requested for the job exceeds the actual memory needed for the job to complete successfully. Consider the below example.
#!/bin/bash
#SBATCH --job-name dataset-processing ## name that will show up in the queue
#SBATCH --output result.out ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out
#SBATCH --ntasks=1 ## number of tasks (analyses) to run
#SBATCH --cpus-per-task=1 ## the number of threads allocated to each task
#SBATCH --mem-per-cpu=50M # memory per CPU core
#SBATCH --partition=interactive ## the partitions to run in (comma seperated)
#SBATCH --time=0-01:00:00 ## time for analysis (day-hour:min:sec)
## Load modules
module load anaconda
conda activate my_env
#Run the program
srun python script.py data/Human_Activity_Recognition_Using_Smartphones_Data.csv
In the script above, 1 Node, 1 CPU, 50MB of memory per CPU, 1 hour of wall time were requested for one single task. Also, the output for the job will be stored in result.out
and the partition is set to interactive which is ideal for debugging/short jobs.
After the resource request, anaconda module is loaded using the module system in Discovery. The custom anaconda environment my_env
was activated next which has all the packages required for the python script script.py
to run. For more detailed explanation about anaconda environments in Discovery, refer to the tutorials at Anaconda Virtual Environments
#Importing packages.
import sys
import os
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
dataset = sys.argv[1]
if dataset == "data/Human_Activity_Recognition_Using_Smartphones_Data.csv":
print("-----Processing the Human Activity Recognition Dataset-----\n")
else:
print("Invalid dataset name entered")
sys.exit(1)
df = pd.read_csv(dataset)
print(df.head(5))
#Data statistics
#Printing the number of rows and columns
print(df.info())
print("The number of rows\n")
print(len(df))
print("The number of columns\n")
print(len(df.columns))
print("Dataframe shape\n")
print(df.shape)
#Check for any null - N/A values)
print("Checking for any N/A values\n")
print(df.isna().values.any())
#Check for any Null values
print("Checking for any null values\n")
print(df.isnull().values.any())
Y = pd.DataFrame(data=df['Activity'])
X = df.drop(['Activity'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X,Y,random_state=1, test_size=0.2)
print('-----------------------------')
print('DecisionTree Test was Called. Wait...')
depths= list(range(1, 31))
trainAccuracy = []
testAccuracy = []
for i in depths:
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=i, random_state=1)
clf_gini.fit(X_train, y_train.values.ravel())
y_pred_gini = clf_gini.predict(X_test)
y_train_pred_gini = clf_gini.predict(X_train)
#Storing the metrics
trainAccuracy.append(accuracy_score(y_train, y_train_pred_gini))
testAccuracy.append(accuracy_score(y_test, y_pred_gini))
print(trainAccuracy)
print(testAccuracy)
The script.py
performs preprocessing on the Activity Recognition Dataset and runs machine learning model. The path for the dataset is data/Human_Activity_Recognition_Using_Smartphones_Data.csv
and the folder structure is shown below.
project |_ script.sh |_ script.py |_ data |_ Human_Activity_Recognition_Using_Smartphones_Data.csv
Also, one key thing to note down here is how large the size of the dataset is. The size of the dataset file is 68M. The below command outputs the file size in human readable format(MB).
[username@discovery-l2 data]$ ls -lh
total 68M
-rw-r----- 1 <username> <username> 68M Jul 29 15:12 Human_Activity_Recognition_Using_Smartphones_Data.csv
The python script takes a comman-line argument which is the relative path to the dataset file and if it’s not equal to data/Human_Activity_Recognition_Using_Smartphones_Data.csv
, then the python script wouldn’t run and exits. The program preprocesses the dataset and runs the decision tree machine learning algorithm.
Submit the Job
Submit the above job and see how it runs. To submit the above job, run the following command.
sbatch script.sh
Submitted batch job 832679
Job Statistics
After the job exits from the queue, the below sacct
command helps to report the job statistics.
sacct -j 832679 --format=jobid,jobname,reqcpus,reqmem,averss,maxrss,elapsed,state%20,exitcode --unit=M
Output
JobID JobName ReqCPUS ReqMem AveRSS MaxRSS Elapsed State ExitCode
------------ ---------- -------- ---------- ---------- ---------- ---------- ---------- --------
832679 dataset-p+ 1 50Mc 00:00:33 OUT_OF_MEMORY 0:125
832679.batch batch 1 50Mc 4.02M 4.02M 00:00:33 OUT_OF_MEMORY 0:125
832679.exte+ extern 1 50Mc 0 0 00:00:33 OUT_OF_MEMORY 0:125
832679.0 python 1 50Mc 70.65M 70.65M 00:00:32 OUT_OF_MEMORY 0:125
Explanation
-
The above output from running the
sacct
command says that the job has ran into Out of Memory problem which can be inferred from the State Field. The reason behind this problem is that, the job script requested 50M of Memory per CPU. However, the dataset that the python script is trying to run has a size equal to 68M which is greater than the memory requested for the job. That’s the reason why the job has failed. The reasoning is further strengthened by the values reported under AveRSS and MaxRSS fields. The AveRSS represents the average memory(RAM) taken by the process and MaxRSS represents the maximum memory(RAM) spiked/taken by the process. Slurm Accounting mechanism catches these statistics and make it available to the users through the sacct command. The MaxRSS field reports 70.65M for the job step(832679.0) which is greater than the memory requested(50M) for the job. -
Hence, AverRSS and MaxRSS fields are very handy to troubleshoot any jobs that got failed/cancelled due to out of memory problems.
-
Also, the output file
result.out
would be generated for the job and this file can also be used to diagnose any problem if the job fails because the reason for failure is also sent by the Slurm to the output file. -
On opening the generated
result.out
file, it shows the following information.vi result.out slurmstepd: error: Detected 1 oom-kill event(s) in StepId=832679.0 cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler. srun: error: discovery-c34: task 0: Out Of Memory slurmstepd: error: Detected 1 oom-kill event(s) in StepId=832679.batch cgroup. Some of your processes may have been killed by the cgroup out-of-memory handler.
-
The above output means that the Slurm detected the job hitting more memory for the job than the requested memory.
The |
Solution
Increasing the --mem-per-cpu
to 100M or 150M should be enough for the python script to process the dataset and complete the job successfully.
Time Out Issues
Consider the below slurm script script.sh
.
#!/bin/bash
#SBATCH --job-name test ## name that will show up in the queue
#SBATCH --output result.out ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out
#SBATCH --ntasks=1 ## number of tasks (analyses) to run
#SBATCH --cpus-per-task=1 ## the number of cpus allocated to each task
#SBATCH --mem-per-cpu=50M # memory per CPU core
#SBATCH --partition=interactive ## the partitions to run in (comma seperated)
#SBATCH --time=0-00:01:00 ## time for analysis (day-hour:min:sec)
srun echo "Starting Process"
srun sleep 180
srun hostname
srun echo "Ending Process"
Explanation of Job Script
-
In the script above, 1 Node, 1 CPU, 50MB of memory per CPU, 1 minute of a wall time for the tasks (Job steps) were requested. Note that all the job steps that begin with the srun command will execute sequentially as one task by one CPU only. Also, the output for the job will be stored in
result.out
and gets generated inside the directory where the job scriptscript.sh
is located. -
The first job step will run the Linux echo command and output Starting process. The next job step(2) will execute the Linux sleep command for 180 seconds. The third(3) job step will echo the
hostname
of the compute node that executed the job. The final job step will just echo out Ending process. Note that these job steps executed sequentially and not in parallel. -
One key thing to note down with this job script is that, the wall time for the job is set as 0–00:01:00 which equals 1 minute. However, the job step(2) needs to execute the Linux sleep command for 180 seconds.
Submit the Job
Submit the above job and see how it runs. To submit the above job, run the following command.
sbatch script.sh
Submitted batch job 822800
Watch Live Status of the Job
To watch the live status of the above job, run the watch squeue -u <discovery-username>
command
watch squeue -u <discovery-username>
Squeue Command Output
Every 2.0s: squeue -u <discovery-username> Tue Aug 24 23:04:30 2021
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
822800 interacti test <discovery-username> R 0:24 1 discovery-c34
To exit the live status of the watch squeue command, press Ctrl + C |
After the job exits from the queue, run the below sacct
command to check the status of the job.
sacct -j 822800 --format=jobid,jobname,elapsed,state,exitcode
Sacct Command Output
JobID JobName Elapsed State ExitCode
------------ ---------- ---------- ---------- --------
822800 test 00:01:17 TIMEOUT 0:0
822800.batch batch 00:01:19 CANCELLED 0:15
822800.exte+ extern 00:01:17 COMPLETED 0:0
822800.0 echo 00:00:01 COMPLETED 0:0
822800.1 sleep 00:01:18 CANCELLED 0:15
Explanation
-
The statistics of the above job is displayed in the above output. Rows 1 and 2 are default job steps, with the first being the job script as a whole and the second being the resources needed by the BATCH script. The fourth row denotes the first process(822800.0) which ran using srun and got completed successfully and took a time of 00:00:01 which is one second. However, the second process(822800.1) which is the Linux sleep command got cancelled. This is because the sleep command was specified for 180 seconds which exceeded the specified wall time(1 minute) for the job. Because the second process wasn’t able to finish on time, the job as a whole wasn’t able to complete and hence the first row’s(822800) STATE column outputted as TIMEOUT which is the reason for the job cancellation by Slurm.
Solution
-
To make the above job to complete the execution of all the job steps successfully, set the wall time of the job to be
#SBATCH --time=0:00:05:00
which equals 5 minutes and this will be sufficient enough for the job to complete successfully.
Additional Insights
-
Also, the
result.out
generated output file can also be used to troubleshoot job failure. Open theresult.out
file which is generated.vi result.out Starting Process srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** JOB 822800 ON discovery-c34 CANCELLED AT 2021-08-24T23:45:42 DUE TO TIME LIMIT *** slurmstepd: error: *** STEP 822800.1 ON discovery-c34 CANCELLED AT 2021-08-24T23:45:42 DUE TO TIME LIMIT ***
-
The first line prints the output of the first job step which is (srun echo "Starting Process"). Because, the second job step(822800.1) which is shown in the last line ran out of time, it got cancelled by the Slurm and the reason for the job cancellation is specified as DUE TO TIME LIMIT. Hence, the whole job(822800) got cancelled by the Slurm.
The "scontrol" command
-
The scontrol command can also be used to get detailed information about the completed job.
Syntax
scontrol show job <job-id>
Output
UserId=crushev(723778) GroupId=crushev(723778) MCS_label=N/A
Priority=191096 Nice=0 Account=nmsu QOS=normal
JobState=COMPLETED Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:00:02 TimeLimit=00:10:00 TimeMin=N/A
SubmitTime=2021-08-13T01:43:43 EligibleTime=2021-08-13T01:43:43
AccrueTime=2021-08-13T01:43:43
StartTime=2021-08-13T01:43:43 EndTime=2021-08-13T01:43:45 Deadline=N/A
PreemptEligibleTime=2021-08-13T01:43:43 PreemptTime=None
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-08-13T01:43:43
Partition=normal AllocNode:Sid=discovery-l2:177992
ReqNodeList=(null) ExcNodeList=(null)
NodeList=discovery-c3
BatchHost=discovery-c3
NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=1,mem=100M,node=1,billing=1
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryCPU=100M MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
Command=/home/crushev/sample-python/script.sh
WorkDir=/home/crushev/sample-python
StdErr=/home/crushev/sample-python/maxFib.out
StdIn=/dev/null
StdOut=/home/crushev/sample-python/maxFib.out
Power=
NtasksPerTRES:0
The "seff" command(Slurm Job Efficiency Report)
-
This command can be used to find the job efficiency report for the jobs which are completed and exited from the queue. If you run this command while the job is still in the R(Running) state, this might report incorrect information.
-
Using the seff command, you can find the memory used, how much % of allocated memory is utilized, CPU information, so on.
Syntax
seff <job-id>
Example
seff 769059
Output
Job ID: 769059
Cluster: discovery
User/Group: <user-name>/<group-name>
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 16
CPU Utilized: 00:00:01
CPU Efficiency: 0.11% of 00:15:28 core-walltime
Job Wall-clock time: 00:00:58
Memory Utilized: 4.79 MB
Memory Efficiency: 4.79% of 100.00 MB
References
For more information about the sacct command in Slurm, refer to the official Slurm documentation Slurm’s sacct command