Skip to content

Submitting Jobs

This page covers submitting Slurm batch jobs on UBELIX. If you are not already familiar with Slurm, you should read the Slurm quickstart guide which covers the basics. You can also refer to the Slurm documentation or manual pages, in particular the page about sbatch.

Resource Allocation

Every job submission starts with a resources allocation (nodes, cores, memory). An allocation is requested for a specific amount of time, and can be created using the salloc, sbatch commands. Whereas salloc and sbatch only create resource allocations, srun launches parallel tasks within such a resource allocation.

Performance considerations

It is crucial to specify a more or less accurate runtime for your job. Requesting too little will result in job timeout, while requesting too much will have a negative impact on job start time and job throughput: Jobs with a shorter runtime have a greater chance to benefit from being backfilled and may therefore start earlier.

It is crucial to request the correct amount of memory for your job. Requesting too little memory will result in job abortion. Requesting too much memory is a waste of resources that could otherwise be allocated to other jobs.

It is crucial to request the correct amount of cores and tasks for your job. Requesting the correct amount of cores and tasks for your job is necessary for optimal job performance. To efficiently to this you need to understand the characteristics of your application:

  • Is it a serial application that can only use one core?
    Use --ntasks=1 / --cpus-per-task=1.
  • Is is able to run on a single machine using multiple cores?
    Use --ntasks=1 / --cpus-per-task=X.
  • Or does it support execution on multiple machines with MPI?
    Use --ntasks=X / --cpus-per-task=1.

Job arrays

Submit series of jobs (collection of similar jobs) as array jobs instead of one by one. This is crucial for backfilling performance and hence job throughput. Instead of submitting the same job repeatedly. See Array jobs.

Common Slurm options

Here is an overview of some of the most commonly used Slurm options.

Basic job specification

Option
Description
--time Set a limit on the total run time of the job allocation. Format: dd-hh:mm:ss
--account Charge resources used by this job to specified project
--partition Request a specific partition for the resource allocation
--qos Specify “Quality of Service”. This can be used to change job limits, e.g. for long jobs or short jobs with large resources. See Partition/QoS page
--job-name Specify a job name. Example: --job-name="Simple Matlab"
--output Redirect standard output. All directories specified in the path must exist before the job starts! By default stderr and stdout are merged into a file slurm-%j.out, where %j is the job allocation number. Example: --output=myCal_%j.out
--error Redirect standard error. All directories specified in the path must exist before the job starts! By default stderr and stdout are merged into a file slurm-%j.out, where %j is the job allocation number. Example: --output=myCal_%j.err
--mail-user Mail address to contact job owner. Must be a valid unibe email address, if used! Example:--mail-user=foo.bar@unibe.ch
--mail-type When to notify a job owner: none, all, begin, end, fail, requeue, array_tasks. Eaxample: --mail-type=end,fail

Request CPU cores

Option Description
--cpus-per-task Set the number of cores per task

Request memory

Option Description
--mem Set the memory per node. Note: Try to use --mem-per-cpu or --mem-per-gpu instead.
--mem-per-cpu Set the memory per allocated CPU cores. Example: --mem-per-cpu=2G
--mem-per-gpu Set the memory per allocated GPU. Example: --mem-per-gpu=2G

Request GPUs

Option Description
--gpus Set the total number of GPUs to be allocated for the job
--gpus-per-node Set the number of GPUs per node
--gpus-per-task Set the number of GPUs per task

For details on how to request GPU resources on UBELIX, please see the GPUs page.

Specify tasks distribution (MPI)

Option Description
--nodes Number of nodes to be allocated to the job
--ntasks Set the maximum number of tasks (MPI ranks)
--ntasks-per-node Set the number of tasks per node
--ntasks-per-socket Set the number of tasks on each node
--ntasks-per-core Set the maximum number of task on each core

sbatch

The sbatch command is used to submit a job script for later execution. It is the most common way to submit a job to the cluster due to its reusability. Slurm options are usually embedded in a job script prefixed by #SBATCH directives. Slurm options specified as command line options overwrite corresponding options embedded in the job script

Syntax

sbatch [options] script [args...]

Job Script

Sneak Peek: A simple Python example

Create sbmission script, python_job.sh allocating 8CPUs, 8GB memory for 1hour:

#!/bin/bash
#SBATCH --job-name="Simple Python example"
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=1G
#SBATCH --cpus-per-task=8

# Your code below this line
module load Anaconda3
eval "$(conda shell.bash hook)"
python3 script.py

Submit the job script:

sbatch python_job.sh
Submitted batch job 30215045

See below for more details and background information.

A batch script is summarized by the following steps:

  • the interpreter to use for the execution of the script: bash
  • directives that define the job options: resources, run time, …
  • setting up the environment: prepare input, environment variables, …
  • run the application

The job script acts as a wrapper for your actual job. The first line is generally #!/bin/bash and specifies that the script should be interpreted as a bash script.

The lines starting with #SBATCH are directives for the workload manager. These have the general syntax

#SBATCH option_name=argument

The available options are shown above and are the same as the one you use on the command line: sbatch --time=01:00:00 in the command line and #SBATCH --time=01:00:00 in a batch script are equivalent. The command line value takes precedence if the same option is present both on the command line and as a directive in a script.

salloc

The salloc command is used to allocate resources (e.g. nodes), possibly with a set of constraints (e.g. number of processor per node) for later utilization. It is typically used to allocate resources and spawn a shell, in which the srun command is used to launch parallel tasks.

Syntax

salloc [options] [<command> [args...]]

Example

bash$ salloc -N 2 -t 10
salloc: Granted job allocation 247
bash$ module load foss
bash$ srun ./mpi_hello_world
Hello, World.  I am 1 of 2 running on knlnode03.ubelix.unibe.ch
Hello, World.  I am 0 of 2 running on knlnode02.ubelix.unibe.ch
bash$ exit
salloc: Relinquishing job allocation 247

srun

The srun command creates job steps. One or multiple srun invocations are usually used from within an existing resource allocation. Thereby, a job step can utilize all resources allocated to the job, or utilize only a subset of the resource allocation. Multiple job steps can run sequentially in the order defined in the batch script or run in parallel, but can together never utilize more resources than provided by the allocation.

Warning

Do not submit a job script using srun directly. Always create an allocation with salloc or embed it in a script submitted with sbatch.

Syntax

srun [options] executable [args...]

Use srun in your job script for executables if these are:

  • MPI applications
  • multiple job tasks (serial or parallel jobs) simultaneously within an allocation

Example Run MPI task:

#!/bin/bash
#SBATCH --job-name="Open MPI example"
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=20
#SBATCH --mem-per-cpu=2G
#SBATCH --time=06:00:00

# Your code below this line
module load foss
srun ./mpi_app.exe

Run two jobs simultaneously:

#!/bin/bash
#SBATCH --job-name="Simultaneous example"
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=4

# Your code below this line
# run 2 threaded applications side-by-side
srun --tasks=1 --cpus-per-task=4 ./app1 inp1.dat &
srun --tasks=1 --cpus-per-task=4 ./app2 inp2.dat &
wait
# wait: Wait for both background commands to finish. This is important when running bash commands in the background (using &)! Otherwise, the job ends immediately.

Please run series of similar tasks as job array. See Array Jobs.

Requesting a Partition / QoS

Per default jobs are submitted to the epyc2 partition and the default QoS job_cpu. The partition option can be used to request different hardware, e.g. gpu partition. And the QoS can be used to run in a specific queue, e.g. job_gpu_debug:

#SBATCH --partition=gpu
#SBATCH --qos=job_gpu_debug

See Partitions / QoS for a list of available partitions and QoS and its specifications.

Job Examples

Sequential Job

Running a serial job with email notification in case of error (1 task is default value):

#!/bin/bash
#SBATCH --mail-user=foo.bar@unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="Serial Job"
#SBATCH --time=00:10:00

# Your code below this line
echo "I'm on host: $HOSTNAME"

Parallel Jobs

Shared Memory Jobs (e.g. OpenMP)

SMP parallelization is based upon dynamically created threads (fork and join) that share memory on a single node. The key request is --cpus-per-task. To run N threads in parallel, we request N CPUs on the node (--cpus-per-task=N).

#!/bin/bash
#SBATCH --mail-user=foo.bar@unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="SMP Job"
#SBATCH --mem-per-cpu=2G
#SBATCH --cpus-per-task=16
#SBATCH --time=01:00:00

# Your code below this line
srun ./my_binary

MPI Jobs (e.g. Open MPI)

MPI parallelization is based upon processes (local or distributed) that communicate by passing messages. Since they don’t rely on shared memory those processes can be distributed among several compute nodes. Use the option --ntasks to request a certain number of tasks (processes) that can be distributed over multiple nodes:

#!/bin/bash
#SBATCH --mail-user=foo.bar@unibe.ch
#SBATCH --mail-type=end
#SBATCH --job-name="MPI Job"
#SBATCH --mem-per-cpu=2G
#SBATCH --ntasks=8
#SBATCH --time=04:00:00

# Your code below this line
# First set the environment for using Open MPI
module load foss
srun ./my_binary

On the ‘bdw’ partition you must use all CPUs provided by a node (20 CPUs). For example to run an OMPI job on 80 CPUs, do:

#!/bin/bash
#SBATCH --mail-user=foo.bar@baz.unibe.ch
#SBATCH --mail-type=end,fail
#SBATCH --job-name="MPI Job"
#SBATCH --mem-per-cpu=2G
#SBATCH --nodes=4     ## or --ntasks=80
#SBATCH --ntasks-per-node=20
#SBATCH --time=12:00:00

# Your code below this line
module load foss
srun ./my_binary

GPU Jobs

For information on how to run GPU jobs on UBELIX, please see the GPUs page.

Automatic requeuing

The UBELIX Slurm configuration has automatic requeuing of jobs upon node failure enabled. It means that if a node fails, your job will be automatically resubmitted to the queue and will have the same job ID and possibly truncate the previous output. Here are some important parameters you can use to alter the default behavior.

  • you can disable automatic requeuing using the --no-requeue option
  • you can avoid your output file being truncated in case of requeuing by using the --open-mode=append option

If you want to perform specific operations in your batch script when a job has been requeued, you can check the value of the SLURM_RESTART_COUNT variable. The value of this variable will be 0 if it is the first time the job is run. If the job has been restarted, the value will be the number of times the job has been restarted.

Common error messages

Below are some common error messages you may get when your job submission fails.

Job violates accounting/QOS policy

The complete error message is:

sbatch: error: AssocMaxSubmitJobLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

The most common causes are:

  • your project has already used all of its allocated compute resources.
  • job script is missing the --account parameter.
  • your project has exceeded the limit for the number of simultaneous jobs, either running or queuing. Note that Slurm counts each job within an array job as a separate job.

Environment Variables

Slurm sets various environment variables available in the context of the job script. Some are set based on the requested resources for the job.

Environment Variable Set By Option Description
SLURM_JOB_NAME --job-name Name of the job
SLURM_ARRAY_JOB_ID ID of your job
SLURM_ARRAY_TASK_ID --array ID of the current array task
SLURM_ARRAY_TASK_MAX --array Job array’s maximum ID (index) number
SLURM_ARRAY_TASK_MIN --array Job array’s minimum ID (index) number
SLURM_ARRAY_TASK_STEP --array Job array’s index step size
SLURM_NTASKS --ntasks Same as -n, --ntasks
SLURM_NTASKS_PER_NODE --ntasks-per-node Number of tasks requested per node. Only set if the --ntasks-per-node option is specified
SLURM_CPUS_PER_TASK --cpus-per-task Number of cpus requested per task. Only set if the --cpus-per-task option is specified
TMPDIR References the disk space for the job on the local scratch

For the full list, see man sbatch