Features
Nodes in Discovery have feature tags assigned to them. Each node has been tagged with a feature based on the Manufacturer, Hyperthreading, Processor name, Processor generation, GPU capability, GPU name, GPU name with GPU memory amount and Hybrid Memory. Users can select nodes to run their jobs based on the feature tags using SBATCH or srun --constraint
flag.
Feature Tags
Below are the list of features tagged with the Nodes in Discovery.
Nodes |
Available Features |
discovery-c[1–6] |
intel, ht, haswell, E5-2640V3 |
discovery-c[7–15] |
intel, ht, broadwell, E5-2650V4 |
discovery-c[16–25] |
intel, ht, skylake, xeon-gold-5117 |
discovery-c[26–35] |
intel, ht, cascade-lake, xeon-gold-6226r |
discovery -c36 |
intel, ht, cascade-lake, xeon-gold-5218t |
discovery-c37 |
intel, ht, cascade-lake, xeon-gold-5218, optane, optane-mem |
discovery-c38 |
intel, ht, cascade-lake, xeon-gold-5218 |
discovery-g1 |
intel, ht, haswell, E5-2640V3, gpu, k40m, k40m-11g |
discovery-g[2–6] |
intel, ht, skylake, xeon-gold-5117, gpu, p100, p100-16g |
discovery-g7 |
intel, ht, skylake, xeon-gold-5120, gpu, v100, v100-16g |
discovery-g[8–11] |
intel, ht, cascade-lake, xeon-gold-5218, gpu, v100, v100-32g |
discovery-g[12–13] |
amd,ht,rome,epyc-7282,gpu,a100,a100-40g |
discovery-g[14–15] |
amd, ht, rome, epyc-7282, gpu, mig, a100_1g.5gb |
discovery-g16 |
intel, ht, skylake, xeon-gold-5118, gpu, t4, t4-16g |
Tag Information
Below are the list of features associated with each tag/category.
Category |
Features |
Manufacturer |
intel, amd |
Hyperthreading |
ht |
Processor Generation |
haswell, broadwell, skylake, cascade-lake, rome |
Processor Name |
E5-2640V3, E5-2650V4, xeon-gold-5117, xeon-gold-6226r, xeon-gold-5218t, xeon-gold-5218, xeon-gold-5120, epyc-7282 |
GPU Capability |
gpu, mig |
GPU Name |
k40m, p100, v100, t4 |
GPU Name with Memory |
k40m-11g, p100-16g, v100-16g, v100-32g,a100,a100-40g , a100_1g.5gb, t4-16g |
Hybrid Memory |
optane, optane-mem |
scontrol Command
You can also the following scontrol
command to find the feature tags of a node.
Syntax
scontrol show node discovery-c1 | egrep "NodeName|AvailableFeatures"
Output
NodeName=discovery-c1 Arch=x86_64 CpuBind=threads CoresPerSocket=8
AvailableFeatures=intel,ht,haswell,E5-2640V3
Examples
The below examples will run simple jobs using srun and explains how one can capitalize on the node feature tags to select the desired node for the jobs.
Consider the below example.
[username@discovery-l2 ~]$ srun -n 1 -p normal hostname
Output
srun: job 786821 queued and waiting for resources
srun: job 786821 has been allocated resources
discovery-c9
Explanation
-
The above srun command prints the hostname(discovery-c9) of the compute node where the job ran on the normal partition.
-
However, with the above command, one can’t choose a compute node with certain features like processor generation, name, so on. for the job to run.
-
With the help of Slurm feature tags associated with each node, you can select a node with certain features for the job to run using the
-C
or--constraint
flag. -
If you want the job to run on a compute node that has haswell processor generation on the normal partition, run the below srun command.
srun -n 1 -p normal --constraint=haswell hostname
Output
srun: job 881257 queued and waiting for resources
srun: job 881257 has been allocated resources
discovery-c4
Explanation
The above srun command prints the hostname of the compute node which is discovery-c4 and it has haswell as the processor generation. Similarly, you can add various feature using the --constraint
flag to request a specific node for your job. Multiple features can also be specified which are discussed below.
Example 1 (AND Operator)
If you want to select a node with two or more features, say haswell processor generation and E5-2640V3 processor, AND operator can be used. The ampersand (&) symbol is used for the AND operator.
Syntax: --constraint="feature1&feature2&..n"
srun -n 1 -p normal --constraint="haswell&E5-2640V3" hostname
Output
srun: job 786930 queued and waiting for resources srun: job 786930 has been allocated resources discovery-c2
Explanation
The above srun command prints the hostname of the compute node which is discovery-c2 and it has haswell as the processor generation and E5-2640V3 processor.
If you are specifying multiple features for the |
Example 2 (OR Operator)
If you want to select a node that has either one of the features, say haswell or skylake processor generation node for the job, OR operator can be used. The vertical bar(|
) is used to specify the OR operator.
Syntax: --constraint="feature1|feature2|..n"
srun -n 1 -p normal --constraint="haswell|skylake" hostname
Output
srun: job 787045 queued and waiting for resources
srun: job 787045 has been allocated resources
discovery-c3
Explanation
The above srun command prints the hostname of the compute node which is discovery-c3 and the node has one of the features (haswell processor generation) specified using the OR operator in the --constraint
flag.
Example 3 (Node Count)
You can also specify the number of nodes needed with some feature by appending an asterisk (*) and count after the feature name. Consider the below example.
srun -p backfill --nodes=4 --ntasks-per-node=1 --constraint="intel*4" hostname
The above example requests a job to run on 4 nodes on the backfill partition with 1 single task running on each node. The --constraint="intel*4"
flag requests that it needs atleast 4 nodes with the manufacturers of the processors as intel.
Output
srun: job 788208 queued and waiting for resources
srun: job 788208 has been allocated resources
discovery-c34
discovery-c35
discovery-c29
discovery-g1
Explanation
Each task prints the hostname
of the node. If you inspect the feature(manufacturer) of the nodes(discovery-c34, discovery-c35, discovery-c29, discovery-g1), all the nodes have intel manufacturers because the constraint is specified as --constraint="intel*4"
. However, if you specify --nodes=4 --constraint="intel*2"
, then atleast two nodes will have intel manufacturers and the other two nodes can be of any feature depending on the resource availability.
Using SBATCH
You can also specify the node features using the --constraint
flag in SBATCH as well. Below is an example of a Slurm SBATCH script which uses the --constraint
flag to request a node with GPU capability and GPU name with Memory as v100-32g. Consider the below script.sh
file.
#!/bin/bash
#SBATCH --output result.out
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=500M
#SBATCH --partition=backfill
#SBATCH --constraint="gpu&v100-32g"
#SBATCH --time=0-01:00:00
## Insert code, and run your programs here (use 'srun').
srun nvidia-smi
Submit the job using the command sbatch script.sh
. Once the job gets completed and exits from the queue, open the generated output file result.out
.
Output
Thu Sep 2 18:16:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-PCIE... On | 00000000:2F:00.0 Off | 0 |
| N/A 39C P0 24W / 250W | 0MiB / 32510MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Explanation
The above output from nvidia-smi
(command line utility) says that, the GPU node having Tesla V100 GPU and 32G of memory ran the job.
Other Options
Some other options are:
Matching OR
If only one of a set of possible options should be used for all allocated nodes, then use the OR operator and enclose the options within square brackets. For example, --constraint="[rack1|rack2|rack3|rack4]" might be used to specify that all nodes must be allocated on a single rack of the cluster, but any of those four racks can be used.
Multiple Counts
Specific counts of multiple resources may be specified by using the AND operator and enclosing the options within square brackets. For example, --constraint="[rack1*2&rack2*4]" might be used to specify that two nodes must be allocated from nodes with the feature ofrack1
and four nodes must be allocated from nodes with the featurerack2
. NOTE: This construct doesn’t support multiple Intel KNL NUMA or MCDRAM modes. For example, while --constraint="[(knl&quad)*2&(knl&hemi)*4]" isn’t supported, --constraint="[haswell*2&(knl&hemi)*4]" is supported. Specification of multiple KNL modes requires the use of a heterogeneous job.
Brackets
Brackets can be used to indicate that you are looking for a set of nodes with the different requirements contained within the brackets. For example, --constraint="[(rack1|rack2)*1&(rack3)*2]" will get you one node with either the "rack1" or "rack2" features and two nodes with the "rack3" feature. The same request without the brackets will try to find a single node that meets those requirements.
Parenthesis
Parenthesis can be used to group like node features together. For example, --constraint="[(knl&snc4&flat)*4&haswell*1]" might be used to specify that four nodes with the featuresknl
,snc4
andflat
plus one node with the feature "haswell" are required. All options within parenthesis should be grouped with AND (For example, "&") operands.
Slurm SBATCH Commands