GPU Jobs

This page explains how to submit GPU jobs to the Discovery Cluster using Slurm.

Partitions with GPUs

Below is the list of partitions that has GPUs.

Partition

GPU Nodes

Maximum Walltime

backfill

discovery-g[1-13, 16]

14-02:00:00 (14 days and 2 hours)

cfdlab

discovery-g[2-6]

7-01:00:00 (7 days and 1 hour)

iiplab

discovery-g7

7-01:00:00 ( 7 days and 1 hour)

interactive

discovery-g[14-15]

1-01:00:00 (1 day and 1 hour)

normal

discovery-g[1-6, 8-13, 16]

7-01:00:00 (7 days and 1 hour)

Partitions such as cfdlab, iiplab are condo partitions and restricted to certain team/research group. For more information about the rules/work policies associated with the partitions, refer to the page → Partitions in Discovery.

For more information about the NVIDIA GPU nodes in Discovery and its specifications, refer to the following page →GPU Nodes in Discovery

CUDA Module

  • CUDA is a parallel computing platform and programming model developed by NVIDIA for computational tasks on GPUs. With CUDA, programmers can speed up the computations significantly by making use of the GPUs.

  • CUDA is available through the module system in Discovery. To load the latest version of CUDA to your environment, run the following command.

module load spack/2022a  gcc/12.1.0-2022a-gcc_8.5.0-ivitefn
module load cuda/11.7.0-2022a-gcc_12.1.0-bbyheai
To run the script on discovery-g1 make sure that the CUDA version is 11.
If you are using containers (Apptainer) or conda environment, you need to bring your CUDA instead of using CUDA module.
  • It will load the default CUDA compiler nvcc in the path. Several versions of CUDA are available and can be loaded as well. To learn more about loading specific version of a module, searching modules, so on in Discovery, refer to the page Module Environments and Commands.

Requesting GPUs

Jobs won’t be allocated GPUs unless it’s specified with the following options during the resource request using sbatch or srun.

Options

Explanation

--gres

Generic resources required per node

--gpus

GPUs required per job

--gpus-per-node

GPUs required per node. Equal to the --gres option for GPUs.

--gpus-per-socket

GPUs required per socket. Requires the job to specify a task socket.

--gpus-per-task

GPUs required per task. Requires the job to specify a task count. This is the recommended option for the GPU jobs.

GPU Node Features

Nodes in Discovery have feature tags assigned to them. GPU nodes have been tagged with extra feature tags based on the GPU capability, GPU name, GPU name with GPU memory amount in addition to the Manufacturer, HyperThreading, Processor name and Processor generation tags. You can select the GPU nodes with certain features using the --constraint flag. For more information on how to select a GPU node with certain features or to find the list of all available features associated with a GPU node, refer to the page point_right Node Features.

Feature Tags

Below are the list of features (GPU related) tagged with the GPU Nodes in Discovery.

Nodes

Available Features

discovery-g1

intel, ht, haswell, E5-2640V3, gpu, k40m, k40m-11g

discovery-g[2–6]

intel, ht, skylake, xeon-gold-5117, gpu, p100, p100-16g

discovery-g7

intel, ht, skylake, xeon-gold-5120, gpu, v100, v100-16g

discovery-g[8–11]

intel, ht, cascade-lake, xeon-gold-5218, gpu, v100, v100-32g

discovery-g[12–13]

amd,ht,rome,epyc-7282,gpu,a100,a100-40g

discovery-g[14–15]

amd, ht, rome, epyc-7282, gpu, mig, a100_1g.5gb

discovery-g16

intel, ht, skylake, xeon-gold-5118, gpu, t4, t4-16g

The last feature of each node represents the GPU memory size of that node. For example, discovery-g12 has the feature v100-32g. This means that the GPU memory size of the node is 32 GB.

When requesting a GPU, you will get it’s entire memory once allocated.

Using SBATCH

Example 1

A simple example that uses GPU and prints the GPU information is shown below.

  1. Login to Discovery. Create a new folder in the home directory and switch to the folder.

  2. Create two files called script.sh and stats.cu. Then copy and paste the below codes.

    • script.sh

    • stats.cu

    Run the following command to create the file.

    vi script.sh

    Copy and Paste the below code into the file.

    #!/bin/bash
    
    ##Resource Request
    
    #SBATCH --job-name CudaJob
    #SBATCH --output result.out   ## filename of the output; the %j is equivalent to jobID; default is slurm-[jobID].out
    #SBATCH --partition=backfill  ## the partitions to run in (comma seperated)
    #SBATCH --ntasks=1  ## number of tasks (analyses) to run
    #SBATCH --gpus-per-task=1 # number of gpus per task
    #SBATCH --constraint=v100 # select node with v100 GPU
    #SBATCH --mem-per-gpu=100M # Memory allocated for the job
    #SBATCH --time=0-00:10:00  ## time for analysis (day-hour:min:sec)
    
    ##Load the CUDA module
    module load cuda
    
    ##Compile the cuda script using the nvcc compiler
    nvcc -o stats stats.cu
    
    ## Run the script
    srun stats

    In the above Slurm script script.sh, 1 GPU was requested for 1 single task on the backfill partition. Also, 10 minutes of Walltime and 100MB of memory per GPU were requested. The --ntasks is set to 1 which means that the number of processes or tasks to carry out. Hence, the srun command will be executed only once. Also, it’s mandatory to specify the Walltime. If not specified, the default time limit of 1 minute will be allocated and the job gets killed after the time gets elapsed. After the resources become available, the job will start to run.

    Because the Slurm script involves a CUDA program to run, the CUDA module needs to be loaded. Hence, include the module load cuda command which loads the default version of CUDA module in Discovery. After loading the module, the CUDA script stats.cu will be compiled using the nvcc compiler and generates the executable called stats. Next, the srun command executes the CUDA script and produces the output.

    Run the following command to create the file.

    vi stats.cu

    Copy and Paste the below code.

    #include <stdio.h>
    #include <cuda_runtime.h>
    
    void printDeviceInfo(cudaDeviceProp prop) {
    
       printf("Name                         - %s\n",  prop.name);
       printf("Total global memory          - %lu MB \n", prop.totalGlobalMem/(1024*1024));
       printf("Total constant memory        - %lu KB \n", prop.totalConstMem/1024);
    
       printf("Shared memory per block      - %lu KB \n", prop.sharedMemPerBlock/1024);
       printf("Total registers per block    - %d\n", prop.regsPerBlock);
       printf("Maximum threads per block    - %d\n", prop.maxThreadsPerBlock);
    
       printf("Clock rate                   - %d\n",  prop.clockRate);
       printf("Number of multi-processors   - %d\n",  prop.multiProcessorCount);
    
      }
    
    int main( ) {
    
        int deviceCount;
        cudaGetDeviceCount(&deviceCount);
        printf("Available CUDA devices - %d\n", deviceCount);
        for (int i=0;i<deviceCount;i++){
    
            // Device informatioon
            printf("\nCUDA Device #%d\n", i);
            cudaDeviceProp prop;
            cudaGetDeviceProperties(&prop, i);
            printDeviceInfo(prop);
    
        }
    }
  3. Now, you need to make the job script as an executable which can be done by running the chmod +x command.

    chmod +x script.sh
  4. Submit the Job Script. It can be done by using the following sbatch command.

    sbatch script.sh
    Submitted batch job 789079
  5. After the job gets completed and exits from the queue, a new output file would be generated called result.out in your working directory.

    $ls
    script.sh stats stats.cu result.out
  6. To view the output, type the following command from your working directory.

    $vi result.out
  7. Output

    Available CUDA devices - 1
    
    CUDA Device #0
    Name                         - Tesla V100-PCIE-32GB
    Total global memory          - 32510 MB
    Total constant memory        - 64 KB
    Shared memory per block      - 48 KB
    Total registers per block    - 65536
    Maximum threads per block    - 1024
    Clock rate                   - 1380000
    Number of multi-processors   - 80
  8. Explanation

    The program prints the CUDA device information like name, global memory, shared memory, so on via API functions. Also, the job ran on the node with v100 GPU since it’s specified in the --constraint flag in the SBATCH resource request. To check the job statistics or find out the allocated resources (GPU) for the job, run the following sacct command.

    sacct -j 789079 --format=jobid,jobname,alloctres%50,elapsed,state,exitcode

    Output

    JobID    JobName                                          AllocTRES             Elapsed  State   ExitCode
    ------------ ---------- -------------------------------------------------- ---------- ---------- --------
    789079          CudaJob         billing=4,cpu=4,gres/gpu=1,mem=100M,node=1   00:00:06  COMPLETED      0:0
    789079.batch      batch                   cpu=4,gres/gpu=1,mem=100M,node=1   00:00:06  COMPLETED      0:0
    789079.exte+     extern         billing=4,cpu=4,gres/gpu=1,mem=100M,node=1   00:00:06  COMPLETED      0:0
    789079.0          stats                      cpu=4,gres/gpu=1,mem=0,node=1   00:00:01  COMPLETED      0:0

    The job has been successfully completed which can be inferred from the above output. The AllocTRES field denotes the resources allocated to the job/step after the job started running. The gres/gpu=1,mem=100M,node=1 information clearly states that 1 GPU has been allocated with 100M of Memory on a single GPU node for the completed job.

Example 2

Below example shows how to perform addition of two arrays using CUDA.

  1. Login to Discovery. Create a new folder in the home directory and switch to the folder.

  2. Create two files called script.sh and add.cu inside the folder. Then copy and paste the below codes.

    • script.sh

    • add.cu

    Run the following command to create the file.

    vi script.sh

    Copy and Paste the below code into the file.

    #!/bin/bash
    
    ##Resource Request
    
    #SBATCH --job-name CudaJob
    #SBATCH --output result.out   ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out
    #SBATCH --ntasks=2  ## number of tasks (analyses) to run
    #SBATCH --gpus-per-task=1 # number of gpus per task
    #SBATCH --mem-per-gpu=100M # Memory allocated per gpu
    #SBATCH --constraint=mig # Select Node with Multi-Instance GPU
    #SBATCH --partition=backfill  ## the partitions to run in (comma seperated)
    #SBATCH --time=0-00:10:00  ## time for analysis (day-hour:min:sec)
    
    ##Load the modules
    module load cuda
    
    ##Compile the cuda script
    nvcc -o add add.cu
    
    ## Run the script
    srun -n 1 add 1 &
    srun -n 1 add 2 &
    wait

    In the above job script script.sh, the --ntasks is set to 2 and 1 GPU was requested for each task. The partition is set to be backfill. Also, 10 minutes of Walltime, 100M of memory per GPU were requested.

    Next, the CUDA module is loaded which is required for the CUDA script add.cu to compile and run. After loading the module, the CUDA program is complied using the nvcc compiler and produces the executable called add if there are no compilation errors.

    In this example, the number of job steps is 2 and each srun command has the ampersand(&) symbol at the end. This symbol is mandatory if the commands need to be executed at the same time. It’s also required to use the wait command at the end when running the commands simultaneously. It ensures that a given task doesn’t cancel itself because of the completion of other tasks.

    Also, the total number of tasks in the job steps(srun -n) must be equal to the --ntasks value in the resource request. Here, the -n flag and its value (1) in a serial execution context is used to specify the number of times a given program should be executed.

    The srun command should also contain the CUDA executable name which is add in this case followed by a command-line argument if the program requires. In the first job step, the add executable is followed by 1 which is the command-line argument for this task. in the second task, it has the command-line argument as 2.

    Run the following command to create the file.

    vi add.cu

    Copy and Paste the below code.

    #include "cuda_runtime.h"
    #include "device_launch_parameters.h"
    
    #include <stdio.h>
    
    __global__ void addKernel(int* c, const int* a, const int* b, int size) {
        int i = blockIdx.x * blockDim.x + threadIdx.x;
        if (i < size) {
            c[i] = a[i] + b[i];
        }
    }
    
    // Helper function for using CUDA to add vectors in parallel.
    void addWithCuda(int* c, const int* a, const int* b, int size) {
        int* dev_a = NULL;
        int* dev_b = NULL;
        int* dev_c = NULL;
    
        // Allocate GPU buffers for three vectors (two input, one output)
        cudaMalloc((void**)&dev_c, size * sizeof(int));
        cudaMalloc((void**)&dev_a, size * sizeof(int));
        cudaMalloc((void**)&dev_b, size * sizeof(int));
    
        // Copy input vectors from host memory to GPU buffers.
        cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
        cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    
        // Launch a kernel on the GPU with one thread for each element.
        // 2 is number of computational blocks and (size + 1) / 2 is a number of threads in a block
        addKernel<<<2, (size + 1) / 2>>>(dev_c, dev_a, dev_b, size);
    
        // cudaDeviceSynchronize waits for the kernel to finish, and returns
        // any errors encountered during the launch.
        cudaDeviceSynchronize();
    
        // Copy output vector from GPU buffer to host memory.
        cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    
        cudaFree(dev_c);
        cudaFree(dev_a);
        cudaFree(dev_b);
    }
    
    int main(int argc, char** argv) {
        const int arraySize = 5;
        const int a[arraySize] = {  1,  2,  3,  4,  5 };
        const int b[arraySize] = { 10, 20, 30, 40, 50 };
        const int c[arraySize] = {11, 12, 13, 14, 15};
        const int d[arraySize] = {100, 200, 300, 400, 500};
    
        int result[arraySize] = { 0 };
        int input = atoi(argv[1]);
    
        if (input == 1) {
        	addWithCuda(result, a, b, arraySize);
        }
        else {
        	addWithCuda(result, c, d, arraySize);
        }
        //Printing the output
        printf("Addition of two arrays = {%d, %d, %d, %d, %d}\n", result[0], result[1], result[2], result[3], result[4]);
    
        cudaDeviceReset();
    
        return 0;
    }

    The above CUDA program add.cu takes a command-line argument and depending on it’s value, it begins adding two arrays and produces the result. If the command-line argument value is 1, it takes up the arrays a and b, performs the addition and displays the result. If the command-line argument is a value other than 1, then it adds the arrays c and d and produces the output.

  3. Now, you need to make the job script as an executable which can be done by running the chmod +x command.

    chmod +x script.sh
  4. Submit the Job Script. It can be done by using the following sbatch command.

    sbatch script.sh
    Submitted batch job 789162
  5. After the job gets completed and exits from the queue, a new output file would be generated called result.out in your working directory along with the executable add.

    $ls
    script.sh add add.cu result.out
  6. To view the output, type the following command from your working directory.

    $vi result.out
  7. Output

    Addition of two arrays = {11, 22, 33, 44, 55}
    Addition of two arrays = {111, 212, 313, 414, 515}
  8. Explanation

    The job computes the addition of two arrays using a CUDA script and prints the output. To find the statistics of the job, run the following sacct command.

    sacct -j 789162 --format=jobid,jobname,alloctres%50,elapsed,state,exitcode
    JobID         JobName                                          AllocTRES         Elapsed    State     ExitCode
    ------------ ---------- -------------------------------------------------- ---------- ---------- --------
    789162       AdditionC+         billing=8,cpu=8,gres/gpu=2,mem=200M,node=1   00:00:10  COMPLETED      0:0
    789162.batch      batch                   cpu=8,gres/gpu=2,mem=200M,node=1   00:00:10  COMPLETED      0:0
    789162.exte+     extern         billing=8,cpu=8,gres/gpu=2,mem=200M,node=1   00:00:10  COMPLETED      0:0
    789162.0            add                      cpu=8,gres/gpu=1,mem=0,node=1   00:00:01  COMPLETED      0:0
    789162.1            add                      cpu=8,gres/gpu=1,mem=0,node=1   00:00:00  COMPLETED      0:0

    The job has been successfully completed which can be inferred from the State field in the above output. The gres/gpu=2,mem=200M,node=1 information from the AllocTRES field clearly states that 2 GPUs have been allocated with 200M of Memory on a single GPU node for the completed job.

Using srun

srun can be used to run jobs interactively on the GPU nodes. Consider the below example.

[<username>@discovery-l2]$ srun -n 1 -p backfill --mem-per-gpu=200M --ntasks=1 --gpus-per-task=1 --constraint=mig -t 01:00:00 --pty /bin/bash

Output

srun: job 789244 queued and waiting for resources
srun: job 789244 has been allocated resources
[username@discovery-g14]$

Explanation

The above srun command denotes that you want to run a login shell(/bin/bash) on the GPU node. Hence, the hostname changed from [username@discovery-l2 ~]$ to [username@discovery-g14 ~]$ once the requested resources becomes available. Also, discovery-g14 has been allocated for the job because a node with feature mig(Multi-Instance GPUs) was requested using the --constraint flag.

References

For more details about GPU management using Slurm, refer to point_right GPU Management in Slurm.