GPU Jobs
This page explains how to submit GPU jobs to the Discovery Cluster using Slurm.
Partitions with GPUs
Below is the list of partitions that has GPUs.
Partition |
GPU Nodes |
Maximum Walltime |
backfill |
discovery-g[1-13, 16] |
14-02:00:00 (14 days and 2 hours) |
cfdlab |
discovery-g[2-6] |
7-01:00:00 (7 days and 1 hour) |
iiplab |
discovery-g7 |
7-01:00:00 ( 7 days and 1 hour) |
interactive |
discovery-g[14-15] |
1-01:00:00 (1 day and 1 hour) |
normal |
discovery-g[1-6, 8-13, 16] |
7-01:00:00 (7 days and 1 hour) |
Partitions such as cfdlab
, iiplab
are condo partitions and restricted to certain team/research group. For more information about the rules/work policies associated with the partitions, refer to the page → Partitions in Discovery.
For more information about the NVIDIA GPU nodes in Discovery and its specifications, refer to the following page →GPU Nodes in Discovery |
CUDA Module
-
CUDA is a parallel computing platform and programming model developed by NVIDIA for computational tasks on GPUs. With CUDA, programmers can speed up the computations significantly by making use of the GPUs.
-
CUDA is available through the module system in Discovery. To load the latest version of CUDA to your environment, run the following command.
module load spack/2022a gcc/12.1.0-2022a-gcc_8.5.0-ivitefn
module load cuda/11.7.0-2022a-gcc_12.1.0-bbyheai
To run the script on discovery-g1 make sure that the CUDA version is 11.
|
If you are using containers (Apptainer) or conda environment, you need to bring your CUDA instead of using CUDA module. |
-
It will load the default CUDA compiler nvcc in the path. Several versions of CUDA are available and can be loaded as well. To learn more about loading specific version of a module, searching modules, so on in Discovery, refer to the page Module Environments and Commands.
Requesting GPUs
Jobs won’t be allocated GPUs unless it’s specified with the following options during the resource request using sbatch
or srun
.
Options |
Explanation |
|
Generic resources required per node |
|
GPUs required per job |
|
GPUs required per node. Equal to the |
|
GPUs required per socket. Requires the job to specify a task socket. |
|
GPUs required per task. Requires the job to specify a task count. This is the recommended option for the GPU jobs. |
GPU Node Features
Nodes in Discovery have feature tags assigned to them. GPU nodes have been tagged with extra feature tags based on the GPU capability, GPU name, GPU name with GPU memory amount in addition to the Manufacturer, HyperThreading, Processor name and Processor generation tags. You can select the GPU nodes with certain features using the --constraint
flag. For more information on how to select a GPU node with certain features or to find the list of all available features associated with a GPU node, refer to the page Node Features.
Feature Tags
Below are the list of features (GPU related) tagged with the GPU Nodes in Discovery.
Nodes |
Available Features |
discovery-g1 |
intel, ht, haswell, E5-2640V3, gpu, k40m, k40m-11g |
discovery-g[2–6] |
intel, ht, skylake, xeon-gold-5117, gpu, p100, p100-16g |
discovery-g7 |
intel, ht, skylake, xeon-gold-5120, gpu, v100, v100-16g |
discovery-g[8–11] |
intel, ht, cascade-lake, xeon-gold-5218, gpu, v100, v100-32g |
discovery-g[12–13] |
amd,ht,rome,epyc-7282,gpu,a100,a100-40g |
discovery-g[14–15] |
amd, ht, rome, epyc-7282, gpu, mig, a100_1g.5gb |
discovery-g16 |
intel, ht, skylake, xeon-gold-5118, gpu, t4, t4-16g |
The last feature of each node represents the GPU memory size of that node. For example, discovery-g12
has the feature v100-32g
. This means that the GPU memory size of the node is 32 GB
.
When requesting a GPU, you will get it’s entire memory once allocated. |
Using SBATCH
Example 1
A simple example that uses GPU and prints the GPU information is shown below.
-
Login to Discovery. Create a new folder in the home directory and switch to the folder.
-
Create two files called
script.sh
andstats.cu
. Then copy and paste the below codes.Run the following command to create the file.
vi script.sh
Copy and Paste the below code into the file.
#!/bin/bash ##Resource Request #SBATCH --job-name CudaJob #SBATCH --output result.out ## filename of the output; the %j is equivalent to jobID; default is slurm-[jobID].out #SBATCH --partition=backfill ## the partitions to run in (comma seperated) #SBATCH --ntasks=1 ## number of tasks (analyses) to run #SBATCH --gpus-per-task=1 # number of gpus per task #SBATCH --constraint=v100 # select node with v100 GPU #SBATCH --mem-per-gpu=100M # Memory allocated for the job #SBATCH --time=0-00:10:00 ## time for analysis (day-hour:min:sec) ##Load the CUDA module module load cuda ##Compile the cuda script using the nvcc compiler nvcc -o stats stats.cu ## Run the script srun stats
In the above Slurm script
script.sh
, 1 GPU was requested for 1 single task on the backfill partition. Also, 10 minutes of Walltime and 100MB of memory per GPU were requested. The--ntasks
is set to 1 which means that the number of processes or tasks to carry out. Hence, the srun command will be executed only once. Also, it’s mandatory to specify the Walltime. If not specified, the default time limit of 1 minute will be allocated and the job gets killed after the time gets elapsed. After the resources become available, the job will start to run.Because the Slurm script involves a CUDA program to run, the CUDA module needs to be loaded. Hence, include the
module load cuda
command which loads the default version of CUDA module in Discovery. After loading the module, the CUDA scriptstats.cu
will be compiled using the nvcc compiler and generates the executable called stats. Next, the srun command executes the CUDA script and produces the output.Run the following command to create the file.
vi stats.cu
Copy and Paste the below code.
#include <stdio.h> #include <cuda_runtime.h> void printDeviceInfo(cudaDeviceProp prop) { printf("Name - %s\n", prop.name); printf("Total global memory - %lu MB \n", prop.totalGlobalMem/(1024*1024)); printf("Total constant memory - %lu KB \n", prop.totalConstMem/1024); printf("Shared memory per block - %lu KB \n", prop.sharedMemPerBlock/1024); printf("Total registers per block - %d\n", prop.regsPerBlock); printf("Maximum threads per block - %d\n", prop.maxThreadsPerBlock); printf("Clock rate - %d\n", prop.clockRate); printf("Number of multi-processors - %d\n", prop.multiProcessorCount); } int main( ) { int deviceCount; cudaGetDeviceCount(&deviceCount); printf("Available CUDA devices - %d\n", deviceCount); for (int i=0;i<deviceCount;i++){ // Device informatioon printf("\nCUDA Device #%d\n", i); cudaDeviceProp prop; cudaGetDeviceProperties(&prop, i); printDeviceInfo(prop); } }
-
Now, you need to make the job script as an executable which can be done by running the
chmod +x
command.chmod +x script.sh
-
Submit the Job Script. It can be done by using the following
sbatch
command.sbatch script.sh Submitted batch job 789079
-
After the job gets completed and exits from the queue, a new output file would be generated called
result.out
in your working directory.$ls script.sh stats stats.cu result.out
-
To view the output, type the following command from your working directory.
$vi result.out
-
Output
Available CUDA devices - 1 CUDA Device #0 Name - Tesla V100-PCIE-32GB Total global memory - 32510 MB Total constant memory - 64 KB Shared memory per block - 48 KB Total registers per block - 65536 Maximum threads per block - 1024 Clock rate - 1380000 Number of multi-processors - 80
-
Explanation
The program prints the CUDA device information like name, global memory, shared memory, so on via API functions. Also, the job ran on the node with v100 GPU since it’s specified in the
--constraint
flag in the SBATCH resource request. To check the job statistics or find out the allocated resources (GPU) for the job, run the following sacct command.sacct -j 789079 --format=jobid,jobname,alloctres%50,elapsed,state,exitcode
Output
JobID JobName AllocTRES Elapsed State ExitCode ------------ ---------- -------------------------------------------------- ---------- ---------- -------- 789079 CudaJob billing=4,cpu=4,gres/gpu=1,mem=100M,node=1 00:00:06 COMPLETED 0:0 789079.batch batch cpu=4,gres/gpu=1,mem=100M,node=1 00:00:06 COMPLETED 0:0 789079.exte+ extern billing=4,cpu=4,gres/gpu=1,mem=100M,node=1 00:00:06 COMPLETED 0:0 789079.0 stats cpu=4,gres/gpu=1,mem=0,node=1 00:00:01 COMPLETED 0:0
The job has been successfully completed which can be inferred from the above output. The AllocTRES field denotes the resources allocated to the job/step after the job started running. The
gres/gpu=1,mem=100M,node=1
information clearly states that 1 GPU has been allocated with 100M of Memory on a single GPU node for the completed job.
Example 2
Below example shows how to perform addition of two arrays using CUDA.
-
Login to Discovery. Create a new folder in the home directory and switch to the folder.
-
Create two files called
script.sh
andadd.cu
inside the folder. Then copy and paste the below codes.Run the following command to create the file.
vi script.sh
Copy and Paste the below code into the file.
#!/bin/bash ##Resource Request #SBATCH --job-name CudaJob #SBATCH --output result.out ## filename of the output; the %j is equal to jobID; default is slurm-[jobID].out #SBATCH --ntasks=2 ## number of tasks (analyses) to run #SBATCH --gpus-per-task=1 # number of gpus per task #SBATCH --mem-per-gpu=100M # Memory allocated per gpu #SBATCH --constraint=mig # Select Node with Multi-Instance GPU #SBATCH --partition=backfill ## the partitions to run in (comma seperated) #SBATCH --time=0-00:10:00 ## time for analysis (day-hour:min:sec) ##Load the modules module load cuda ##Compile the cuda script nvcc -o add add.cu ## Run the script srun -n 1 add 1 & srun -n 1 add 2 & wait
In the above job script
script.sh
, the --ntasks is set to 2 and 1 GPU was requested for each task. The partition is set to be backfill. Also, 10 minutes of Walltime, 100M of memory per GPU were requested.Next, the CUDA module is loaded which is required for the CUDA script add.cu to compile and run. After loading the module, the CUDA program is complied using the nvcc compiler and produces the executable called add if there are no compilation errors.
In this example, the number of job steps is 2 and each srun command has the ampersand(&) symbol at the end. This symbol is mandatory if the commands need to be executed at the same time. It’s also required to use the wait command at the end when running the commands simultaneously. It ensures that a given task doesn’t cancel itself because of the completion of other tasks.
Also, the total number of tasks in the job steps(srun -n) must be equal to the --ntasks value in the resource request. Here, the -n flag and its value (1) in a serial execution context is used to specify the number of times a given program should be executed.
The srun command should also contain the CUDA executable name which is add in this case followed by a command-line argument if the program requires. In the first job step, the add executable is followed by 1 which is the command-line argument for this task. in the second task, it has the command-line argument as 2.
Run the following command to create the file.
vi add.cu
Copy and Paste the below code.
#include "cuda_runtime.h" #include "device_launch_parameters.h" #include <stdio.h> __global__ void addKernel(int* c, const int* a, const int* b, int size) { int i = blockIdx.x * blockDim.x + threadIdx.x; if (i < size) { c[i] = a[i] + b[i]; } } // Helper function for using CUDA to add vectors in parallel. void addWithCuda(int* c, const int* a, const int* b, int size) { int* dev_a = NULL; int* dev_b = NULL; int* dev_c = NULL; // Allocate GPU buffers for three vectors (two input, one output) cudaMalloc((void**)&dev_c, size * sizeof(int)); cudaMalloc((void**)&dev_a, size * sizeof(int)); cudaMalloc((void**)&dev_b, size * sizeof(int)); // Copy input vectors from host memory to GPU buffers. cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice); // Launch a kernel on the GPU with one thread for each element. // 2 is number of computational blocks and (size + 1) / 2 is a number of threads in a block addKernel<<<2, (size + 1) / 2>>>(dev_c, dev_a, dev_b, size); // cudaDeviceSynchronize waits for the kernel to finish, and returns // any errors encountered during the launch. cudaDeviceSynchronize(); // Copy output vector from GPU buffer to host memory. cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost); cudaFree(dev_c); cudaFree(dev_a); cudaFree(dev_b); } int main(int argc, char** argv) { const int arraySize = 5; const int a[arraySize] = { 1, 2, 3, 4, 5 }; const int b[arraySize] = { 10, 20, 30, 40, 50 }; const int c[arraySize] = {11, 12, 13, 14, 15}; const int d[arraySize] = {100, 200, 300, 400, 500}; int result[arraySize] = { 0 }; int input = atoi(argv[1]); if (input == 1) { addWithCuda(result, a, b, arraySize); } else { addWithCuda(result, c, d, arraySize); } //Printing the output printf("Addition of two arrays = {%d, %d, %d, %d, %d}\n", result[0], result[1], result[2], result[3], result[4]); cudaDeviceReset(); return 0; }
The above CUDA program add.cu takes a command-line argument and depending on it’s value, it begins adding two arrays and produces the result. If the command-line argument value is 1, it takes up the arrays a and b, performs the addition and displays the result. If the command-line argument is a value other than 1, then it adds the arrays c and d and produces the output.
-
Now, you need to make the job script as an executable which can be done by running the
chmod +x
command.chmod +x script.sh
-
Submit the Job Script. It can be done by using the following
sbatch
command.sbatch script.sh Submitted batch job 789162
-
After the job gets completed and exits from the queue, a new output file would be generated called
result.out
in your working directory along with the executable add.$ls script.sh add add.cu result.out
-
To view the output, type the following command from your working directory.
$vi result.out
-
Output
Addition of two arrays = {11, 22, 33, 44, 55} Addition of two arrays = {111, 212, 313, 414, 515}
-
Explanation
The job computes the addition of two arrays using a CUDA script and prints the output. To find the statistics of the job, run the following sacct command.
sacct -j 789162 --format=jobid,jobname,alloctres%50,elapsed,state,exitcode
JobID JobName AllocTRES Elapsed State ExitCode ------------ ---------- -------------------------------------------------- ---------- ---------- -------- 789162 AdditionC+ billing=8,cpu=8,gres/gpu=2,mem=200M,node=1 00:00:10 COMPLETED 0:0 789162.batch batch cpu=8,gres/gpu=2,mem=200M,node=1 00:00:10 COMPLETED 0:0 789162.exte+ extern billing=8,cpu=8,gres/gpu=2,mem=200M,node=1 00:00:10 COMPLETED 0:0 789162.0 add cpu=8,gres/gpu=1,mem=0,node=1 00:00:01 COMPLETED 0:0 789162.1 add cpu=8,gres/gpu=1,mem=0,node=1 00:00:00 COMPLETED 0:0
The job has been successfully completed which can be inferred from the State field in the above output. The
gres/gpu=2,mem=200M,node=1
information from the AllocTRES field clearly states that 2 GPUs have been allocated with 200M of Memory on a single GPU node for the completed job.
Using srun
srun
can be used to run jobs interactively on the GPU nodes. Consider the below example.
[<username>@discovery-l2]$ srun -n 1 -p backfill --mem-per-gpu=200M --ntasks=1 --gpus-per-task=1 --constraint=mig -t 01:00:00 --pty /bin/bash
Output
srun: job 789244 queued and waiting for resources
srun: job 789244 has been allocated resources
[username@discovery-g14]$
Explanation
The above srun command denotes that you want to run a login shell(/bin/bash
) on the GPU node. Hence, the hostname
changed from [username@discovery-l2 ~]$
to [username@discovery-g14 ~]$
once the requested resources becomes available. Also, discovery-g14 has been allocated for the job because a node with feature mig(Multi-Instance GPUs) was requested using the --constraint
flag.
References
For more details about GPU management using Slurm, refer to GPU Management in Slurm.