Skip to content

SLURM partition and QOS

UBELIX provides different CPU and GPU architectures. These are generally structured in partitions and further tunable by “Quality of Service” (QoS).

Partitions

We are currently operating the following partitions:

Partition job type CPU / GPU node / GPU memory local Scratch
epyc2 (default) single and multi-core AMD Epyc2 2x64 cores
AMD Epyc4 2x96 cores
1TB
1.5TB
1TB
bdw full nodes only (x*20cores) Intel Broadwell 2x10 cores 156GB 1TB
gpu GPU
(8 GPUs per node,
varying CPUs)
Nvidia GTX 1080 Ti
Nvidia RTX 3090
Nvidia RTX 4090
Nvidia Tesla P100
Nvidia A100
Nvidia H100
11GB
24GB
24GB
12GB
80GB
96GB
800GB
1.92TB
1.92TB
800GB
1.92TB
1.92TB
gpu-invest GPU see gpu partition
icpu-investor single and multi-core see epyc2 partition

The current usage can be listed on the UBELIX status page

QoS

Within these partitions, QoS are used to distinguish different job limits. Each QoS has a specific purpose, e.g. to allow quick debug jobs to schedule faster than regular jobs.

The following QoS are defined on UBELIX:

QoS Time limit Max Jobs Partitions Description
job_cpu 96 hours 20000 epyc2,bdw This is the default CPU qos. It’s used for all general computing.
job_cpu_long 16 days 50 epyc2,bdw This CPU qos is used for very long jobs. Note: Checkpointing is recommended!
job_cpu_debug 20 min 1 epyc2,bdw This CPU qos is used for quick debug jobs (max 10 cores)
job_gpu 24 hours gpu This is the default GPU qos. It’s used for general GPU computing.
job_gpu_debug 20 min 1 gpu This GPU qos is used for quick debug jobs on GPUs (max 1 GPU).
job_gpu_preemptable 24 hours gpu-invest This GPU qos is used to request idle investor GPU resources. See the note below for details!
job_gpu_investor gpu-invest These GPU qos are used by investors to request their GPU resources.
job_interactive 8 hours 1 all This qos is used for interactive CPU/GPU jobs (i.e, OnDemand). Jobs are assigned higher priority to start quickly.
job_icpu-investor icpu-investor These CPU qos are used by investors to request their CPU resources.

Some QoS have more specific resource limits associated to them, e.g. the number of GPUs that can be requested per user. These limits can be viewed using the sqos command:

sqos -h
Usage: ./sqos [partition_name | qos_name]

If a partition name is given, it retrieves all QoS associated with that partition as per slurm.conf.
If a QoS name is given, it displays the details for that specific QoS.
Without arguments, the script shows all QoS for the current user.

Examples:
  sqos                   # Show all QoS for the current user
  sqos partition_name    # Show QoS for the specified partition
  sqos qos_name          # Show details for the specified QoS

Investor QoS

Investors get pseudo-exclusive access to their invested resources. The membership to these investor qos is managed by the investor or their deputy. Membership changes need to be communicated to the HPC team. For details on investing in UBELIX see Costs and Investments.

Preemptable QoS

The resources dedicated to investors can be used by non-investing users too. Idle investor resources can be used by jobs with the QOS job_gpu_preemptable. However, these preemptable jobs may be terminated by investor jobs at any time! If the job has been terminated to free resources for the investor, the preemptable job is rescheduled in the queue. This makes the qos job_gpu_preemptable especially suitable to jobs that support automatic checkpointing or restarts.