Showing Job Statistics

Showing Information on Jobs

The sacct command

To view the statistics of a completed job using Slurm’s sacct command.

syntax: sacct -j <job id> or sacct -j <job id> --format=<params>

$ sacct -j 215578

Output

~~~~~~ JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
215578           maxFib     normal       nmsu          1  COMPLETED      0:0
215578.batch      batch                  nmsu          1  COMPLETED      0:0
215578.0         python                  nmsu          1  COMPLETED      0:0

You can get statistics (accounting data) on completed jobs by passing either the jobID or username flags. Here, the command sacct -j 215578 is used to show statistics about the completed job. This shows information such as: the partition your job executed on, the account, and number of allocated CPUS per job steps. Also, the exit code and status (Completed, Pending, Failed, so on) for all jobs and job steps were displayed.

The first column describes the job IDs of the several job steps. Rows 1 and 2 are default job steps, with the first being the job script as a whole and the second being the SBATCH directives. The third row 215578.0 contains the information about the first process which ran using srun. Assuming if there are more srun commands the sub job IDs would increment as follows 215578.1 215578.2.

  • You can also pass other parameters to the sacct command to retrieve extra details about the job.

    $ sacct -j 215578 --format=JobID,Start,End,Elapsed,NCPUS

    Output

    ~~~~~~~JobID               Start                 End    Elapsed      NCPUS
    ------------ ------------------- ------------------- ---------- ----------
    215578       2020-09-04T09:53:11 2020-09-04T09:53:11   00:00:00          1
    215578.batch 2020-09-04T09:53:11 2020-09-04T09:53:11   00:00:00          1
    215578.0     2020-09-04T09:53:11 2020-09-04T09:53:11   00:00:00          1

    On the output above, you can see information about the Start and End timestamp, number of CPUs, and the Elapsed time of the job.

  • You can also retrieve information about jobs that ran at a given period of time by passing a start or end time flags like so

    sacct --starttime=2020-09-01 --format=jobid,jobname,exit,group,maxrss,comment,partition,nnodes,ncpus

    Output

    ~~~~~~~JobID    JobName ExitCode     Group     MaxRSS  Partition   NNodes  AllocCPUS      State
    ------------ ---------- -------- --------- ---------- ---------- -------- ---------- ----------
    213974             test      0:0   vaduaka                normal        1          3  COMPLETED
    213974.batch      batch      0:0                    0                   1          3  COMPLETED
    213974.exte+     extern      0:0                    0                   1          3  COMPLETED
    213974.0         python      0:0                    0                   1          1  COMPLETED
    213974.1         python      0:0                    0                   1          1  COMPLETED
    213974.2         python      0:0                    0                   1          1  COMPLETED
    215576           maxFib      0:0   vaduaka                normal        1          1  COMPLETED
    215576.batch      batch      0:0                    0                   1          1  COMPLETED
    215576.exte+     extern      0:0                  88K                   1          1  COMPLETED
    215576.0         python      0:0                    0                   1          1  COMPLETED
    215577           maxFib      0:0   vaduaka                normal        1          1  COMPLETED
    215577.batch      batch      0:0                    0                   1          1  COMPLETED
    215577.exte+     extern      0:0                  84K                   1          1  COMPLETED
    215577.0         python      0:0                    0                   1          1  COMPLETED
    215578           maxFib      0:0   vaduaka                normal        1          1  COMPLETED
    215578.batch      batch      0:0                    0                   1          1  COMPLETED
    215578.exte+     extern      0:0                    0                   1          1  COMPLETED
    215578.0         python      0:0                    0                   1          1  COMPLETED
    215665           maxFib      0:0   vaduaka                normal        1          1  COMPLETED
    215665.batch      batch      0:0                    0                   1          1  COMPLETED
    215665.exte+     extern      0:0                  92K                   1          1  COMPLETED
    215665.0         python      0:0                    0                   1          1  COMPLETED

    On the output above, you can see information about the job steps that was carried out throughout the entirety of the job. Also, the name of the job, exit-code, user group, the maximum resident set size of all tasks in job (size of RAM used at each task) were displayed. Furthermore, partition, number of nodes used, number of allocated CPUs, and state of the job were also shown.

    For more details about using the sacct please use the man sacct command.

    $ man sacct

    To view a list of possible parameters you could pass to retrieve specific job details, use the sacct -e command.

    $ sacct -e

    Output

        Account             AdminComment        AllocCPUS           AllocGRES
        AllocNodes          AllocTRES           AssocID             AveCPU
        AveCPUFreq          AveDiskRead         AveDiskWrite        AvePages
        AveRSS              AveVMSize           BlockID             Cluster
        Comment             Constraints         ConsumedEnergy      ConsumedEnergyRaw
        CPUTime             CPUTimeRAW          DBIndex             DerivedExitCode
        Elapsed             ElapsedRaw          Eligible            End
        ExitCode            Flags               GID                 Group
        JobID               JobIDRaw            JobName             Layout
        MaxDiskRead         MaxDiskReadNode     MaxDiskReadTask     MaxDiskWrite
        MaxDiskWriteNode    MaxDiskWriteTask    MaxPages            MaxPagesNode
        MaxPagesTask        MaxRSS              MaxRSSNode          MaxRSSTask
        MaxVMSize           MaxVMSizeNode       MaxVMSizeTask       McsLabel
        MinCPU              MinCPUNode          MinCPUTask          NCPUS
        NNodes              NodeList            NTasks              Priority
        Partition           QOS                 QOSRAW              Reason
        ReqCPUFreq          ReqCPUFreqMin       ReqCPUFreqMax       ReqCPUFreqGov
        ReqCPUS             ReqGRES             ReqMem              ReqNodes
        ReqTRES             Reservation         ReservationId       Reserved
        ResvCPU             ResvCPURAW          Start               State
        Submit              Suspended           SystemCPU           SystemComment
        Timelimit           TimelimitRaw        TotalCPU            TRESUsageInAve
        TRESUsageInMax      TRESUsageInMaxNode  TRESUsageInMaxTask  TRESUsageInMin
        TRESUsageInMinNode  TRESUsageInMinTask  TRESUsageInTot      TRESUsageOutAve
        TRESUsageOutMax     TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin
        TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot     UID
        User                UserCPU             WCKey               WCKeyID
        WorkDir