Skip to content

SLURM partition and QOS

UBELIX provides different CPU and GPU architectures. These are generally structured in partitions and further tunable by “Quality of Service” (QoS).

Partitions

There are the currently 3 partitions:

Partition job type CPU / GPU node / GPU memory local Scratch
epyc2 (default) single and multi-core AMD Epyc2 2x64 cores 1TB 1TB
bdw full nodes only (x*20cores) Intel Broadwell 2x10 cores 156GB 1TB
gpu GPU
(8 GPUs per node,
varying CPUs)
Nvidia GTX 1080 Ti
Nvidia RTX 3090
Nvidia RTX 4090
Nvidia Tesla P100
Nvidia A100
11GB
24GB
24GB
12GB
80GB
800GB
1.92TB
1.92TB
800GB
1.92TB

The current usage can be listed on the UBELIX status page

QoS

Within these partitions, QoS are used to distinguish different job limits. In each partition there is a default QoS. Each QoS has specific limits that can be viewed directly on the cluster:

sacctmgr show qos format=name%20,maxwall,maxsubmitpu,maxtrespu%80
Description
Name Name of the QOS.
MaxWall  Maximum wall clock time each job is able to use. MaxWall format is <min> or <hr>:<min>:<sec> or <days>-<hr>:<min>:<sec>.
MaxSubmitPU Maximum number of jobs pending or running state at any time per user.
MaxTRESPU Maximum number of TRES (i.e cpu, mem, nodes, GPUs, etc) each user is able to use. You can see the list of available resources by running sacctmgr show tres.

Note, that you might not have access to all of the QoS shown in the output. To see for which QoS your account has valid associations you can use:

sacctmgr show assoc where user=$USER format=account,partition,qos%80

Investor QoS

Investors get pseudo-exclusive access to certain resources. The membership to these investor qos is managed by the investor or their deputy. Membership changes need to be communicated to the HPC team.

As an example, the members of an GPU investor, submit jobs with:

module load Workspace         # use the Workspace account
sbatch --partition=gpu-invest job.sh

Preemptable

The resources dedicated to investors can be used by non-investing users too. A certain amount of CPUs/GPUs are “reserved” in the investor partitions. But if not used, jobs with the QOS job_gpu_preemptable can run on these resources. But beware that preemptable jobs may be terminated by investor jobs at any time! Therefore use the qos job_gpu_preemptable only if your job supports checkpointing or restarts.