Partitions in Discovery
What are Partitions?
Partitions are work queues that have a set of rules/policies and computational nodes included in it to run the jobs. The available partitions are normal, interactive, backfill
and so on. Run the below command to find the available list of partitions in discovery.
Syntax: sinfo
$ sinfo
Output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal* up 7-01:00:00 30 mix discovery-c[2-14,16-25,28,31],discovery-g[2,5,8-10]
normal* up 7-01:00:00 13 alloc discovery-c[15,26-27,29-30,32-33,37-38],discovery-g[1,6,11,16]
normal* up 7-01:00:00 4 idle discovery-g[3-4,12-13]
interactive up 1-01:00:00 1 mix discovery-c34
interactive up 1-01:00:00 3 idle discovery-c35,discovery-g[14-15]
backfill up 14-02:00:0 30 mix discovery-c[2-14,16-25,28,31],discovery-g[2,5,8-10]
backfill up 14-02:00:0 15 alloc discovery-c[15,26-27,29-30,32-33,36-38],discovery-g[1,6-7,11,16]
backfill up 14-02:00:0 4 idle discovery-g[3-4,12-13]
The output shows the list of all the available partitions in discovery as of February 2024.
The state alloc
denotes that the nodes are allocated for the jobs.
The state mix
implies that some CPUs in the nodes are allocated while others remain idle.
Some partitions in discovery are condo partitions which are restricted to certain researchers and lab groups. |
normal
It’s the default queue. Some important information about the normal partition can be found below.
Parameter | values |
---|---|
Maximum Walltime |
7–01:00:00 (7 days and 1 hour) |
Nodes |
discovery-c[1-33, 37-38], discovery-g[1-6, 8-13, 16] |
Total Nodes |
33 |
Maximum Jobs(Running) |
10 |
Maximum Submitted Jobs |
20 |
Maximum jobs is the highest number of jobs that can actively run at a time in a partition. Maximum Submitted Jobs is the maximum number of jobs you can submit to a partition. In normal partition, you can submit 20 jobs but only 10 jobs will be actively running and the remaining 10 jobs will be in the queue. |
interactive
This partition is ideal for running interactive jobs.
To learn more about running interactive jobs, refer to the page Interactive Jobs in Discovery. |
Parameter | values |
---|---|
Maximum wall-time |
1–01:00:00 (1 day and 1 hour) |
Nodes |
discovery-c[34–35], discovery-g[14–15] |
Total Nodes |
4 |
Maximum Jobs(Running) |
3 |
Maximum Submitted Jobs |
3 |
Maximum CPU Per Job |
16 |
Maximum Memory per Job |
64G |
backfill
This partition scavenges nodes from all partitions to use. It has the lowest priority of all the partitions. The jobs submitted to the backfill partition may be stopped and requeued multiple times depending on the demand of high priority jobs.
To find more information about the backfill and checkpoints, refer to the page → Backfill and Checkpoints in Discovery page.
Parameter | values |
---|---|
Maximum wall-time |
14–02:00:00 (14 days and 2 hours) |
Nodes |
discovery-c[1-33, 36-38], discovery-g[1-13, 16] |
Total Nodes |
54 |
Maximum Jobs(Running) |
10 |
Maximum Submitted Jobs |
20 |
HPC team are exploring to get the best out of the backfill queue and your valuable suggestions are always welcome. |
Condo Partitions
Some partitions in Discovery are condo partitions and are restricted for certain team/research group. New partitions are getting added and the below table shows the list of the condo partitions.
Partition | Owned By |
---|---|
|
Dr. Brungard’s Lab |
|
Dr. Gross’s Lab |
|
Dr. Boucheron’s Lab |
Details about each condo partition are as follows:
Partition | Details | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
|
|
||||||||||
|
|
||||||||||
|
|
Scontrol Show
To find more information about a partition like AllowGroups, AllowAccounts, MaxNodes, QoS, etc, the scontrol show
command can be used to view the information.
Syntax : scontrol show partition <partition-name>
scontrol show partition normal
Output :
PartitionName=normal
AllowGroups=discovery-users_normal,pkgmgr AllowAccounts=ALL AllowQos=ALL
AllocNodes=ALL Default=YES QoS=p-normal
DefaultTime=00:01:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=7-01:00:00 MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=discovery-c[1-15,26-31],discovery-g[1,12-13,16]
PriorityJobFactor=1 PriorityTier=25 RootOnly=NO ReqResv=NO OverSubscribe=FORCE:1
OverTimeLimit=NONE PreemptMode=SUSPEND
State=UP TotalCPUs=1216 TotalNodes=25 SelectTypeParameters=NONE
JobDefaults=DefCpuPerGPU=4
DefMemPerCPU=512 MaxMemPerNode=UNLIMITED
TRES=cpu=1116,mem=5385600M,node=25,billing=1116,gres/gpu=16
The above output shows detailed information about the normal partition. It also shows QoS information QoS= p-normal
which is discussed in detail in the next section.
Partition QoS
For every partition, there is a Quality of Service
which has different parameters like MaxJobs, MaxSubmitJobs, etc defined for the partition. This has an effect on the jobs submitted by the user on the partition. The QoS for the normal partition is p-normal
which is inferred from the above scontrol show partition normal
command output. Run the below command to find more details about the QoS.
Syntax: sacctmgr show qos where name =<qos-name> format=<header1,header2,….n>
sacctmgr show qos where name=p-normal format=name,maxJobs,maxSubmit
Output:
~~~~~~Name MaxJobs MaxSubmit
---------- ------- ---------
p-normal 10 20
The output shows some parameters which are defined for the QoS p-normal
for the normal partition. The QoS has defined MaxJobs limit to 10` which means that you can have only 10 jobs in running state in normal partition. The MaxSubmit parameter shows 20
which means that you can submit 20 jobs to the normal partition. However, only 10 will be in the running state and the other 10 will be in the queue.
In similar manner, there is a different QoS defined for every partition in HPC.
Can You Submit Jobs to Condo Partitions ?
You can submit the jobs to the condo partitions only if you belong to the research group that owns the partition and you have the authorization to do so. Hence, you can’t submit the jobs to the condo partitions if you don’t have the permission. The alternative way is to use the backfill
partition that has the nodes from all the partitions for the usage.
But remember that whenever you are using backfill partition, it’s always recommended to use appropriate code checkpoints because the jobs submitted to the backfill partitions may be stopped and requeued multiple times. For information about the backfill and code checkpoints, refer to the Backfill and Checkpoints page in discovery. |
How to Switch Between Different Partitions for Jobs Submission?
By default, if you don’t specify the partition name in the batch script, the job gets submitted to the normal
partition. For more information about switching between the partitions to submit the jobs, refer to the → Introduction to Slurm page in discovery.
References
For more information about partition QoS, refer to the following link https://slurm.schedmd.com/qos.html