GPU Jobs

This page explains how to submit GPU jobs to the Discovery Cluster using Slurm.

Partitions with GPUs

Below is the list of partitions that has GPUs.

Partition

GPU Nodes

Maximum Walltime

backfill

discovery-g[1-13, 16]

14-02:00:00 (14 days and 2 hours)

cfdlab

discovery-g[2-6]

7-01:00:00 (7 days and 1 hour)

iiplab

discovery-g7

7-01:00:00 ( 7 days and 1 hour)

interactive

discovery-g[14-15]

1-01:00:00 (1 day and 1 hour)

normal

discovery-g[1-6, 8-13, 16]

7-01:00:00 (7 days and 1 hour)

Partitions such as cfdlab, iiplab are condo partitions and restricted to certain team/research group. For more information about the rules/work policies associated with the partitions, refer to the page → Partitions in Discovery.

For more information about the NVIDIA GPU nodes in Discovery and its specifications, refer to the following page → GPU Nodes in Discovery

CUDA Module

CUDA is a parallel computing platform and programming model developed by NVIDIA for computational tasks on GPUs. With CUDA, programmers can speed up the computations significantly by making use of the GPUs.
CUDA is available through the module system in Discovery. To search for CUDA modules, run the following command:

$ module spider cuda
----------------------------------------------------------------------------
  cuda:
----------------------------------------------------------------------------
     Versions:
        cuda/11.8.0-2023a-gcc_12.2.0-tnzr2gn
        cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs

----------------------------------------------------------------------------
  For detailed information about a specific "cuda" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs
----------------------------------------------------------------------------

The above output shows that at the time of preparing this documentation, two CUDA modules are available. To load the CUDA module cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs to your environment, you need to find the dependencies of that module, run the following command.

$ module spider cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs

----------------------------------------------------------------------------
  cuda: cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs
----------------------------------------------------------------------------

    You will need to load all module(s) on any one of the lines below before the "cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs" module is available to load.

      spack/2023a  gcc/12.2.0-2023a-gcc_8.5.0-e643dqu

    Help:
      CUDA is a parallel computing platform and programming model invented by
      NVIDIA. It enables dramatic increases in computing performance by
      harnessing the power of the graphics processing unit (GPU). Note: This
      package does not currently install the drivers necessary to run CUDA.
      These will need to be installed manually. See:
      https://docs.nvidia.com/cuda/ for details.

To load the module in your environment, load its dependencies first, as shown in the following command:

module load spack/2023a  gcc/12.2.0-2023a-gcc_8.5.0-e643dqu
module load cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs

To run the script on discovery-g1 make sure that the CUDA version is 11.

If you are using containers (Apptainer) or conda environment, you need to bring your CUDA instead of using CUDA module.

It will load the default CUDA compiler nvcc in the path. Several versions of CUDA are available and can be loaded as well. To learn more about loading specific version of a module, searching modules, so on in Discovery, refer to the page Module Environments and Commands.

Requesting GPUs

Jobs won’t be allocated GPUs unless it’s specified with the following options during the resource request using sbatch or srun.

Options

Explanation

--gres

Generic resources required per node

--gpus

GPUs required per job

--gpus-per-node

GPUs required per node. Equal to the --gres option for GPUs.

--gpus-per-socket

GPUs required per socket. Requires the job to specify a task socket.

--gpus-per-task

GPUs required per task. Requires the job to specify a task count. This is the recommended option for the GPU jobs.

GPU Node Features

Nodes in Discovery have feature tags assigned to them. GPU nodes have been tagged with extra feature tags based on the GPU capability, GPU name, GPU name with GPU memory amount in addition to the Manufacturer, HyperThreading, Processor name and Processor generation tags. You can select the GPU nodes with certain features using the --constraint flag. For more information on how to select a GPU node with certain features or to find the list of all available features associated with a GPU node, refer to the page Node Features.

Feature Tags

Below are the list of features (GPU related) tagged with the GPU Nodes in Discovery.

Nodes

Available Features

discovery-g1

intel, ht, haswell, E5-2640V3, gpu, k40m, k40m-11g

discovery-g[2–6]

intel, ht, skylake, xeon-gold-5117, gpu, p100, p100-16g

discovery-g7

intel, ht, skylake, xeon-gold-5120, gpu, v100, v100-16g

discovery-g[8–11]

intel, ht, cascade-lake, xeon-gold-5218, gpu, v100, v100-32g

discovery-g[12–13]

amd,ht,rome,epyc-7282,gpu,a100,a100-40g

discovery-g[14–15]

amd, ht, rome, epyc-7282, gpu, mig, a100_1g.5gb

discovery-g16

intel, ht, skylake, xeon-gold-5118, gpu, t4, t4-16g

The last feature of each node represents the GPU memory size of that node. For example, discovery-g12 has the feature v100-32g. This means that the GPU memory size of the node is 32 GB.

When requesting a GPU, you will get it’s entire memory once allocated.

Using SBATCH

Example 1

A simple example that uses GPU and prints the GPU information is shown below.

Create two files called script.sh and stats.cu. Then copy and paste the below codes.

script.sh
stats.cu

Run the following command to create the file.

vi script.sh

Copy and Paste the below code into the file.

#!/bin/bash

##Resource Request

#SBATCH --job-name CudaJob
#SBATCH --output result.out   ## filename of the output; the %j is equivalent to jobID; default is slurm-[jobID].out
#SBATCH --partition=backfill  ## the partitions to run in (comma seperated)
#SBATCH --ntasks=1  ## number of tasks (analyses) to run
#SBATCH --gpus-per-task=1 # number of gpus per task
#SBATCH --constraint=v100 # select node with v100 GPU
#SBATCH --mem-per-gpu=100M # Memory allocated for the job
#SBATCH --time=0-00:10:00  ## time for analysis (day-hour:min:sec)

##Load the CUDA module
module load spack/2023a  gcc/12.2.0-2023a-gcc_8.5.0-e643dqu
module load cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs

##Compile the cuda script using the nvcc compiler
nvcc -o stats stats.cu

## Run the script
srun stats

In the above Slurm script script.sh, 1 GPU was requested for 1 single task on the backfill partition. Also, 10 minutes of Walltime and 100MB of memory per GPU were requested. The --ntasks is set to 1 which means that the number of processes or tasks to carry out. Hence, the srun command will be executed only once. Also, it’s mandatory to specify the Walltime. If not specified, the default time limit of 1 minute will be allocated and the job gets killed after the time gets elapsed. After the resources become available, the job will start to run.

Because the Slurm script involves a CUDA program to run, the CUDA module needs to be loaded. Hence, include the module load cuda command which loads the default version of CUDA module in Discovery. After loading the module, the CUDA script stats.cu will be compiled using the nvcc compiler and generates the executable called stats. Next, the srun command executes the CUDA script and produces the output.

Run the following command to create the file.

vi stats.cu

Copy and Paste the below code.

#include <stdio.h>
#include <cuda_runtime.h>

void printDeviceInfo(cudaDeviceProp prop) {

   printf("Name                         - %s\n",  prop.name);
   printf("Total global memory          - %lu MB \n", prop.totalGlobalMem/(1024*1024));
   printf("Total constant memory        - %lu KB \n", prop.totalConstMem/1024);

   printf("Shared memory per block      - %lu KB \n", prop.sharedMemPerBlock/1024);
   printf("Total registers per block    - %d\n", prop.regsPerBlock);
   printf("Maximum threads per block    - %d\n", prop.maxThreadsPerBlock);

   printf("Clock rate                   - %d\n",  prop.clockRate);
   printf("Number of multi-processors   - %d\n",  prop.multiProcessorCount);

  }

int main( ) {

    int deviceCount;
    cudaGetDeviceCount(&deviceCount);
    printf("Available CUDA devices - %d\n", deviceCount);
    for (int i=0;i<deviceCount;i++){

        // Device informatioon
        printf("\nCUDA Device #%d\n", i);
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, i);
        printDeviceInfo(prop);

    }
}

Now, you need to make the job script as an executable which can be done by running the chmod +x command.
```
chmod +x script.sh
```
Submit the Job Script. It can be done by using the following sbatch command.
```
sbatch script.sh
Submitted batch job 789079
```
After the job gets completed and exits from the queue, a new output file would be generated called result.out in your working directory.
```
$ls
script.sh stats stats.cu result.out
```
To view the output, type the following command from your working directory.
```
$vi result.out
```

Output

Available CUDA devices - 1

CUDA Device #0
Name                         - Tesla V100-PCIE-32GB
Total global memory          - 32510 MB
Total constant memory        - 64 KB
Shared memory per block      - 48 KB
Total registers per block    - 65536
Maximum threads per block    - 1024
Clock rate                   - 1380000
Number of multi-processors   - 80

Explanation

The program prints the CUDA device information like name, global memory, shared memory, so on via API functions. Also, the job ran on the node with v100 GPU since it’s specified in the --constraint flag in the SBATCH resource request. To check the job statistics or find out the allocated resources (GPU) for the job, run the following sacct command.

sacct -j 789079 --format=jobid,jobname,alloctres%50,elapsed,state,exitcode

Output

JobID    JobName                                          AllocTRES             Elapsed  State   ExitCode
------------ ---------- -------------------------------------------------- ---------- ---------- --------
789079          CudaJob         billing=4,cpu=4,gres/gpu=1,mem=100M,node=1   00:00:06  COMPLETED      0:0
789079.batch      batch                   cpu=4,gres/gpu=1,mem=100M,node=1   00:00:06  COMPLETED      0:0
789079.exte+     extern         billing=4,cpu=4,gres/gpu=1,mem=100M,node=1   00:00:06  COMPLETED      0:0
789079.0          stats                      cpu=4,gres/gpu=1,mem=0,node=1   00:00:01  COMPLETED      0:0

The job has been successfully completed which can be inferred from the above output. The AllocTRES field denotes the resources allocated to the job/step after the job started running. The gres/gpu=1,mem=100M,node=1 information clearly states that 1 GPU has been allocated with 100M of Memory on a single GPU node for the completed job.

Example 2

Below example shows how to perform addition of two arrays using CUDA.

Create two files called script.sh and add.cu inside the folder. Then copy and paste the below codes.

script.sh
add.cu

Run the following command to create the file.

vi script.sh

Copy and Paste the below code into the file.

#!/bin/bash

##Resource Request

#SBATCH --job-name CudaJob
#SBATCH --output result.out   ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out
#SBATCH --ntasks=2  ## number of tasks (analyses) to run
#SBATCH --gpus-per-task=1 # number of gpus per task
#SBATCH --mem-per-gpu=100M # Memory allocated per gpu
#SBATCH --constraint=mig # Select Node with Multi-Instance GPU
#SBATCH --partition=backfill  ## the partitions to run in (comma seperated)
#SBATCH --time=0-00:10:00  ## time for analysis (day-hour:min:sec)

##Load the modules
module load spack/2023a  gcc/12.2.0-2023a-gcc_8.5.0-e643dqu
module load cuda/12.1.0-2023a-gcc_12.2.0-eqyilhs

##Compile the cuda script
nvcc -o add add.cu

## Run the script
srun -n 1 add 1 &
srun -n 1 add 2 &
wait

In the above job script script.sh, the --ntasks is set to 2 and 1 GPU was requested for each task. The partition is set to be backfill. Also, 10 minutes of Walltime, 100M of memory per GPU were requested.

Next, the CUDA module is loaded which is required for the CUDA script add.cu to compile and run. After loading the module, the CUDA program is complied using the nvcc compiler and produces the executable called add if there are no compilation errors.

In this example, the number of job steps is 2 and each srun command has the ampersand(&) symbol at the end. This symbol is mandatory if the commands need to be executed at the same time. It’s also required to use the wait command at the end when running the commands simultaneously. It ensures that a given task doesn’t cancel itself because of the completion of other tasks.

Also, the total number of tasks in the job steps(srun -n) must be equal to the --ntasks value in the resource request. Here, the -n flag and its value (1) in a serial execution context is used to specify the number of times a given program should be executed.

The srun command should also contain the CUDA executable name which is add in this case followed by a command-line argument if the program requires. In the first job step, the add executable is followed by 1 which is the command-line argument for this task. in the second task, it has the command-line argument as 2.

Run the following command to create the file.

vi add.cu

Copy and Paste the below code.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void addKernel(int* c, const int* a, const int* b, int size) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < size) {
        c[i] = a[i] + b[i];
    }
}

// Helper function for using CUDA to add vectors in parallel.
void addWithCuda(int* c, const int* a, const int* b, int size) {
    int* dev_a = NULL;
    int* dev_b = NULL;
    int* dev_c = NULL;

    // Allocate GPU buffers for three vectors (two input, one output)
    cudaMalloc((void**)&dev_c, size * sizeof(int));
    cudaMalloc((void**)&dev_a, size * sizeof(int));
    cudaMalloc((void**)&dev_b, size * sizeof(int));

    // Copy input vectors from host memory to GPU buffers.
    cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);

    // Launch a kernel on the GPU with one thread for each element.
    // 2 is number of computational blocks and (size + 1) / 2 is a number of threads in a block
    addKernel<<<2, (size + 1) / 2>>>(dev_c, dev_a, dev_b, size);

    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaDeviceSynchronize();

    // Copy output vector from GPU buffer to host memory.
    cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);

    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
}

int main(int argc, char** argv) {
    const int arraySize = 5;
    const int a[arraySize] = {  1,  2,  3,  4,  5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    const int c[arraySize] = {11, 12, 13, 14, 15};
    const int d[arraySize] = {100, 200, 300, 400, 500};

    int result[arraySize] = { 0 };
    int input = atoi(argv[1]);

    if (input == 1) {
    	addWithCuda(result, a, b, arraySize);
    }
    else {
    	addWithCuda(result, c, d, arraySize);
    }
    //Printing the output
    printf("Addition of two arrays = {%d, %d, %d, %d, %d}\n", result[0], result[1], result[2], result[3], result[4]);

    cudaDeviceReset();

    return 0;
}

The above CUDA program add.cu takes a command-line argument and depending on it’s value, it begins adding two arrays and produces the result. If the command-line argument value is 1, it takes up the arrays a and b, performs the addition and displays the result. If the command-line argument is a value other than 1, then it adds the arrays c and d and produces the output.

Now, you need to make the job script as an executable which can be done by running the chmod +x command.
```
chmod +x script.sh
```
Submit the Job Script. It can be done by using the following sbatch command.
```
sbatch script.sh
Submitted batch job 789162
```
After the job gets completed and exits from the queue, a new output file would be generated called result.out in your working directory along with the executable add.
```
$ls
script.sh add add.cu result.out
```
To view the output, type the following command from your working directory.
```
$vi result.out
```

Output

Addition of two arrays = {11, 22, 33, 44, 55}
Addition of two arrays = {111, 212, 313, 414, 515}

Explanation

The job computes the addition of two arrays using a CUDA script and prints the output. To find the statistics of the job, run the following sacct command.

sacct -j 789162 --format=jobid,jobname,alloctres%50,elapsed,state,exitcode

JobID         JobName                                          AllocTRES         Elapsed    State     ExitCode
------------ ---------- -------------------------------------------------- ---------- ---------- --------
789162       AdditionC+         billing=8,cpu=8,gres/gpu=2,mem=200M,node=1   00:00:10  COMPLETED      0:0
789162.batch      batch                   cpu=8,gres/gpu=2,mem=200M,node=1   00:00:10  COMPLETED      0:0
789162.exte+     extern         billing=8,cpu=8,gres/gpu=2,mem=200M,node=1   00:00:10  COMPLETED      0:0
789162.0            add                      cpu=8,gres/gpu=1,mem=0,node=1   00:00:01  COMPLETED      0:0
789162.1            add                      cpu=8,gres/gpu=1,mem=0,node=1   00:00:00  COMPLETED      0:0

The job has been successfully completed which can be inferred from the State field in the above output. The gres/gpu=2,mem=200M,node=1 information from the AllocTRES field clearly states that 2 GPUs have been allocated with 200M of Memory on a single GPU node for the completed job.

Using srun

srun can be used to run jobs interactively on the GPU nodes. Consider the below example.

[<username>@discovery-l2]$ srun -n 1 -p backfill --mem-per-gpu=200M --ntasks=1 --gpus-per-task=1 --constraint=mig -t 01:00:00 --pty /bin/bash

Output

srun: job 789244 queued and waiting for resources
srun: job 789244 has been allocated resources
[username@discovery-g14]$

Explanation

The above srun command denotes that you want to run a login shell(/bin/bash) on the GPU node. Hence, the hostname changed from [username@discovery-l2 ~]$ to [username@discovery-g14 ~]$ once the requested resources become available. Also, discovery-g14 has been allocated for the job because a node with feature mig(Multi-Instance GPUs) was requested using the --constraint flag.

Other NVIDIA Compilers

The NVIDIA HPC SDK (High Performance Computing Software Development Kit) is a suite of compilers, libraries, and tools designed to help developers build high-performance applications, particularly for scientific computing, engineering, and AI workloads. It’s optimized for NVIDIA GPUs and supports multi-core CPUs as well. It supports different compilers including C, C++, and Fortran compilers that support GPU acceleration through CUDA, OpenACC, and OpenMP.

For the more complete NVIDIA compiler set on Discovery, you need to look for nvhpc module.

 $ module spider nvhpc

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  nvhpc:
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Versions:
        nvhpc/22.3-2022a-gcc_12.1.0-ghobkle
        nvhpc/22.3-2022a-gcc_12.1.0-6duwutl
        nvhpc/22.9-2023a-gcc_8.5.0-in6rhug
        nvhpc/24.3-2024a-gcc_8.5.0-cepeoml

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  For detailed information about a specific "nvhpc" package (including how to load the modules) use the module's full name.
  Note that names that have a trailing (E) are extensions provided by other modules.
  For example:

     $ module spider nvhpc/24.3-2024a-gcc_8.5.0-cepeoml
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

For example, to load nvhpc/24.3-2024a-gcc_8.5.0-cepeoml in your environment, you need to find what are the dependencies of that module:

$ module spider nvhpc/24.3-2024a-gcc_8.5.0-cepeoml
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  nvhpc: nvhpc/24.3-2024a-gcc_8.5.0-cepeoml
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    You will need to load all module(s) on any one of the lines below before the "nvhpc/24.3-2024a-gcc_8.5.0-cepeoml" module is available to load.
      spack/2024a

The output shows that you must load spack/2024a first and then you can load nvhpc/24.3-2024a-gcc_8.5.0-cepeoml module.

References

For more details about GPU management using Slurm, refer to GPU Management in Slurm.