GPUs

This page contains all information you need to successfully submit GPU-jobs on UBELIX. When submitting to the GPU partition the GPU type specification is required.

Applications do only run on GPUs if they are built specifically to run on GPUs that means with GPU support, e.g. CUDA. Please ensure that your application supports GPUs before submitting to the GPU partitions.

GPU Types

UBELIX currently features various types of GPUs. You have to choose an architecture and use one of the following --gres option to select it.

Type	SLURM gres option
Nvidia Geforce RTX 3090	`--gres=gpu:rtx3090:<number_of_gpus>`
Nvidia Geforce RTX 4090	`--gres=gpu:rtx4090:<number_of_gpus>`
Nvidia A100	`--gres=gpu:a100:<number_of_gpus>`
Nvidia H100	`--gres=gpu:h100:<number_of_gpus>`

Alternatively, you may use the --gpus, --gpus-per-node and --gpus-per-tasks otions. Note that the GPU type still needs to be specified as shown above.

For details on the memory available on the different types of GPU, please see our GPU Hardware page.

Job Submission

GPU jobs must be submitted to the gpu or gpu-invest partitions.

#SBATCH --partition=gpu #or gpu-invest
#SBATCH --gres=gpu:<type>:<number_of_gpus>

Requesting CPU and memory resources with GPUs

To ensure fair GPU allocations a restriction on the CPU and memory resources that can be requested per GPU is implemented.

In the past, we observed that GPU resources were often left unused because some jobs requested disproportionately large amounts of CPU or memory per GPU. To address this issue, we have implemented a restriction on the CPU and memory resources that can be requested per GPU:

Type	CPUs per GPU	Memory per GPU
Nvidia RTX 3090	4	60GB
Nvidia RTX 4090	16	90GB
Nvidia A100	20	80GB
Nvidia H100	16	90GB

If you submit a GPU job that requests more resources than are available per GPU, your job will be rejected. If your job requires more CPU and memory resources, you may choose to allocate additional GPUs even if these additional GPUs remain unused by your application.

QoS `job_gpu_preemptable`

For investors we provide the gpu-invest investor partitions with a specific QoS per investor that guarantees instant access to the purchased resources. Nevertheless, to efficiently use all resources, the QoS job_gpu_preemptable exists in the gpu partition. Jobs, submitted with this QoS have access to all GPU resources, but may be interrupted if resources are required for investor jobs. Short jobs, and jobs that make use of checkpointing will benefit from these additional resources.

Example: Requesting any four RTX3090 from the resource pool in the gpu-invest partition:

#SBATCH --partition=gpu-invest
#SBATCH --qos=job_gpu_preemptable
#SBATCH --gres=gpu:rtx3090:4

## By default, jobs that are preempted are resubmitted automatically.
## If this is undesirable for you, use the following option to enable that the job, if preempted,
## won't be re-queued but canceled instead:
#SBATCH --no-requeue

CUDA

We provide compiler and library to build CUDA-based application. These are accessible using environment modules. Use module spider CUDA to see which versions are available:

module spider CUDA
------------------------------------------------------------------------------------------------------------------------------------
  CUDA:
------------------------------------------------------------------------------------------------------------------------------------
    Description:
      CUDA (formerly Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA
      and implemented by the graphics processing units (GPUs) that they produce. CUDA gives developers access to the virtual
      instruction set and memory of the parallel computational elements in CUDA GPUs.

     Versions:
        CUDA/11.8.0
        CUDA/12.1.1
        CUDA/12.2.0

Run module load <module> to load a specific version of CUDA:

module load CUDA/12.2.0

cuDNN

If you need cuDNN you must only load the cuDNN module. The appropriate CUDA version is then loaded automatically as a dependency.

GPU Usage Monitoring

To verify the usage of one or multiple GPUs the nvidia-smi tool can be utilized. The tool needs to be launched on the related node. After the job started running, a new job step can be created using srun and call nvidia-smi to display the resource utilization. Here we attach the process to an job with the jobID 123456. You need to replace the jobId with your gathered jobID previously presented in the sbatch output.

$ sbatch job.sh
Submitted batch job 123456
$ squeue --me
# verify that job gets started
$ srun --overlap --jobid 123456 nvidia-smi
Fri Nov 11 11:11:11 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:04:00.0 Off |                  N/A |
| 23%   25C    P8     8W / 250W |      1MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:08:00.0 Off |                  N/A |
| 23%   24C    P8     8W / 250W |      1MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Therewith the GPU core utilization and memory usage can be displayed for all GPU cards belonging to that job.

Note that this is a one-off presentation of the usage and the called nvidia-smi command runs within your allocation. The required resources for this job step should be minimal and should not noticeably influence your job performance.

Further Information

CUDA: https://developer.nvidia.com/cuda-zone
CUDA C/C++ Basics: http://www.nvidia.com/docs/IO/116711/sc11-cuda-c-basics.pdf