SLURM partition and QOS
UBELIX provides different CPU and GPU architectures. These are generally structured in partitions and further tunable by “Quality of Service” (QoS).
Partitions
There are the currently 3 partitions:
Partition | job type | CPU / GPU | node / GPU memory | local Scratch |
---|---|---|---|---|
epyc2 (default) | single and multi-core | AMD Epyc2 2x64 cores | 1TB | 1TB |
bdw | full nodes only (x*20cores) | Intel Broadwell 2x10 cores | 156GB | 1TB |
gpu | GPU (8 GPUs per node, varying CPUs) |
Nvidia GTX 1080 Ti Nvidia RTX 3090 Nvidia RTX 4090 Nvidia Tesla P100 Nvidia A100 |
11GB 24GB 24GB 12GB 80GB |
800GB 1.92TB 1.92TB 800GB 1.92TB |
The current usage can be listed on the UBELIX status page
QoS
Within these partitions, QoS are used to distinguish different job limits. In each partition there is a default QoS. Each QoS has specific limits that can be viewed directly on the cluster:
sacctmgr show qos format=name%20,maxwall,maxsubmitpu,maxtrespu%80
Description | |
---|---|
Name | Name of the QOS. |
MaxWall | Maximum wall clock time each job is able to use. MaxWall format is <min> or <hr>:<min>:<sec> or <days>-<hr>:<min>:<sec> . |
MaxSubmitPU | Maximum number of jobs pending or running state at any time per user. |
MaxTRESPU | Maximum number of TRES (i.e cpu, mem, nodes, GPUs, etc) each user is able to use. You can see the list of available resources by running sacctmgr show tres . |
Note, that you might not have access to all of the QoS shown in the output. To see for which QoS your account has valid associations you can use:
sacctmgr show assoc where user=$USER format=account,partition,qos%80
Investor QoS
Investors get pseudo-exclusive access to certain resources. The membership to these investor qos is managed by the investor or their deputy. Membership changes need to be communicated to the HPC team.
As an example, the members of an GPU investor, submit jobs with:
module load Workspace # use the Workspace account
sbatch --partition=gpu-invest job.sh
Preemptable
The resources dedicated to investors can be used by non-investing users too.
A certain amount of CPUs/GPUs are “reserved” in the investor partitions. But if not used, jobs with the QOS job_gpu_preemptable
can run on these resources. But beware that preemptable jobs may be terminated by investor jobs at any time! Therefore use the qos job_gpu_preemptable
only if your job supports checkpointing or restarts.