Features

Nodes in Discovery have feature tags assigned to them. Each node has been tagged with a feature based on the Manufacturer, Hyperthreading, Processor name, Processor generation, GPU capability, GPU name, GPU name with GPU memory amount and Hybrid Memory. Users can select nodes to run their jobs based on the feature tags using SBATCH or srun --constraint flag.

Feature Tags

Below are the list of features tagged with the Nodes in Discovery.

Nodes

Available Features

discovery-c[1–6]

intel, ht, haswell, E5-2640V3

discovery-c[7–15]

intel, ht, broadwell, E5-2650V4

discovery-c[16–25]

intel, ht, skylake, xeon-gold-5117

discovery-c[26–35]

intel, ht, cascade-lake, xeon-gold-6226r

discovery -c36

intel, ht, cascade-lake, xeon-gold-5218t

discovery-c37

intel, ht, cascade-lake, xeon-gold-5218, optane, optane-mem

discovery-c38

intel, ht, cascade-lake, xeon-gold-5218

discovery-g1

intel, ht, haswell, E5-2640V3, gpu, k40m, k40m-11g

discovery-g[2–6]

intel, ht, skylake, xeon-gold-5117, gpu, p100, p100-16g

discovery-g7

intel, ht, skylake, xeon-gold-5120, gpu, v100, v100-16g

discovery-g[8–11]

intel, ht, cascade-lake, xeon-gold-5218, gpu, v100, v100-32g

discovery-g[12–13]

amd,ht,rome,epyc-7282,gpu,a100,a100-40g

discovery-g[14–15]

amd, ht, rome, epyc-7282, gpu, mig, a100_1g.5gb

discovery-g16

intel, ht, skylake, xeon-gold-5118, gpu, t4, t4-16g

Tag Information

Below are the list of features associated with each tag/category.

Category

Features

Manufacturer

intel, amd

Hyperthreading

ht

Processor Generation

haswell, broadwell, skylake, cascade-lake, rome

Processor Name

E5-2640V3, E5-2650V4, xeon-gold-5117, xeon-gold-6226r, xeon-gold-5218t, xeon-gold-5218, xeon-gold-5120, epyc-7282

GPU Capability

gpu, mig

GPU Name

k40m, p100, v100, t4

GPU Name with Memory

k40m-11g, p100-16g, v100-16g, v100-32g,a100,a100-40g , a100_1g.5gb, t4-16g

Hybrid Memory

optane, optane-mem

Scontrol Command

You can also the following scontrol command to find the feature tags of a node.

Syntax

scontrol show node discovery-c1 | egrep "NodeName|AvailableFeatures"

Output

NodeName=discovery-c1 Arch=x86_64 CpuBind=threads CoresPerSocket=8
   AvailableFeatures=intel,ht,haswell,E5-2640V3

Examples

The below examples will run simple jobs using srun and explains how one can capitalize on the node feature tags to select the desired node for the jobs.

Consider the below example.

[username@discovery-l2 ~]$ srun -n 1 -p normal hostname

Output

srun: job 786821 queued and waiting for resources
srun: job 786821 has been allocated resources
discovery-c9

Explanation

  • The above srun command prints the hostname(discovery-c9) of the compute node where the job ran on the normal partition.

  • However, with the above command, one can’t choose a compute node with certain features like processor generation, name, so on. for the job to run.

  • With the help of Slurm feature tags associated with each node, you can select a node with certain features for the job to run using the -C or --constraint flag.

  • If you want the job to run on a compute node that has haswell processor generation on the normal partition, run the below srun command.

srun -n 1 -p normal --constraint=haswell hostname

Output

srun: job 881257 queued and waiting for resources
srun: job 881257 has been allocated resources
discovery-c4

Explanation

The above srun command prints the hostname of the compute node which is discovery-c4 and it has haswell as the processor generation. Similarly, you can add various feature using the --constraint flag to request a specific node for your job. Multiple features can also be specified which are discussed below.

Example 1 (AND Operator)

If you want to select a node with two or more features, say haswell processor generation and E5-2640V3 processor, AND operator can be used. The ampersand (&) symbol is used for the AND operator.

Syntax: --constraint="feature1&feature2&..n"

srun -n 1 -p normal --constraint="haswell&E5-2640V3" hostname

Output

srun: job 786930 queued and waiting for resources
srun: job 786930 has been allocated resources
discovery-c2

Explanation

The above srun command prints the hostname of the compute node which is discovery-c2 and it has haswell as the processor generation and E5-2640V3 processor.

If you are specifying multiple features for the --constraint flag, it must be enclosed within the double quotes.

Example 2 (OR Operator)

If you want to select a node that has either one of the features, say haswell or skylake processor generation node for the job, OR operator can be used. The vertical bar(|) is used to specify the OR operator.

Syntax: --constraint="feature1|feature2|..n"

srun -n 1 -p normal --constraint="haswell|skylake" hostname

Output

srun: job 787045 queued and waiting for resources
srun: job 787045 has been allocated resources
discovery-c3

Explanation

The above srun command prints the hostname of the compute node which is discovery-c3 and the node has one of the features (haswell processor generation) specified using the OR operator in the --constraint flag.

Example 3 (Node Count)

You can also specify the number of nodes needed with some feature by appending an asterisk (*) and count after the feature name. Consider the below example.

srun -p backfill --nodes=4 --ntasks-per-node=1 --constraint="intel*4" hostname

The above example requests a job to run on 4 nodes on the backfill partition with 1 single task running on each node. The --constraint="intel*4" flag requests that it needs atleast 4 nodes with the manufacturers of the processors as intel.

Output

srun: job 788208 queued and waiting for resources
srun: job 788208 has been allocated resources
discovery-c34
discovery-c35
discovery-c29
discovery-g1

Explanation

Each task prints the hostname of the node. If you inspect the feature(manufacturer) of the nodes(discovery-c34, discovery-c35, discovery-c29, discovery-g1), all the nodes have intel manufacturers because the constraint is specified as --constraint="intel*4". However, if you specify --nodes=4 --constraint="intel*2", then atleast two nodes will have intel manufacturers and the other two nodes can be of any feature depending on the resource availability.

Using SBATCH

You can also specify the node features using the --constraint flag in SBATCH as well. Below is an example of a Slurm SBATCH script which uses the --constraint flag to request a node with GPU capability and GPU name with Memory as v100-32g. Consider the below script.sh file.

#!/bin/bash

#SBATCH --output result.out
#SBATCH --ntasks=1
#SBATCH --gpus-per-task=1
#SBATCH --mem-per-cpu=500M
#SBATCH --partition=backfill
#SBATCH --constraint="gpu&v100-32g"
#SBATCH --time=0-01:00:00


## Insert code, and run your programs here (use 'srun').
srun nvidia-smi

Submit the job using the command sbatch script.sh. Once the job gets completed and exits from the queue, open the generated output file result.out.

Output

Thu Sep  2 18:16:32 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  On   | 00000000:2F:00.0 Off |                    0 |
| N/A   39C    P0    24W / 250W |      0MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Explanation

The above output from nvidia-smi(command line utility) says that, the GPU node having Tesla V100 GPU and 32G of memory ran the job.

Other Options

Some other options are:

Matching OR If only one of a set of possible options should be used for all allocated nodes, then use the OR operator and enclose the options within square brackets. For example, --constraint="[rack1|rack2|rack3|rack4]" might be used to specify that all nodes must be allocated on a single rack of the cluster, but any of those four racks can be used.

Multiple Counts Specific counts of multiple resources may be specified by using the AND operator and enclosing the options within square brackets. For example, --constraint="[rack1*2&rack2*4]" might be used to specify that two nodes must be allocated from nodes with the feature of rack1 and four nodes must be allocated from nodes with the feature rack2. NOTE: This construct doesn’t support multiple Intel KNL NUMA or MCDRAM modes. For example, while --constraint="[(knl&quad)*2&(knl&hemi)*4]" isn’t supported, --constraint="[haswell*2&(knl&hemi)*4]" is supported. Specification of multiple KNL modes requires the use of a heterogeneous job.

Brackets Brackets can be used to indicate that you are looking for a set of nodes with the different requirements contained within the brackets. For example, --constraint="[(rack1|rack2)*1&(rack3)*2]" will get you one node with either the "rack1" or "rack2" features and two nodes with the "rack3" feature. The same request without the brackets will try to find a single node that meets those requirements.

Parenthesis Parenthesis can be used to group like node features together. For example, --constraint="[(knl&snc4&flat)*4&haswell*1]" might be used to specify that four nodes with the features knl, snc4 and flat plus one node with the feature "haswell" are required. All options within parenthesis should be grouped with AND (For example, "&") operands.

— https://slurm.schedmd.com/sbatch.html
Slurm SBATCH Commands