Array Jobs with Slurm
Description
You want to submit multiple jobs that are identical or differ only in some arguments. Instead of submitting N jobs independently, you can submit one array job unifying N tasks.
Submitting an Array
To submit an array job, specify the number of tasks as a range of task ids using the –array option:
#SBATCH --array=n[,k[,...]][-m[:s]]
The task id range specified in the option argument may be a single number, a simple range of the form n-m, a range with a step size s, a comma separated list of values, or a combination thereof. The task ids will be exported to the job tasks via the environment variable SLURM_ARRAY_TASK_ID. Other variables available in the context of the job describing the task range are: SLURM_ARRAY_TASK_MAX, SLURM_ARRAY_TASK_MIN, SLURM_ARRAY_TASK_STEP.
Specifying –array=10 will not submit an array job with 10 tasks, but an array job with a single task with task id 10. To run an array job with multiple tasks you must specify a range or a comma separated list of task ids.
Limit the Number of Concurrently Running Tasks
You may want to limit the number of concurrently running tasks if the tasks are very resource demanding and too many of them running concurrently would lower the overall performance of the cluster. To limit the number of tasks that are allowed to run concurrently, use a “%” separator:
#SBATCH --array=n[,k[,...]][-m[:s]]%<max_tasks>
Canceling Individual Tasks
You can cancel individual tasks of an array job by indicating tasks ids to the scancel command:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
79265_[49-99:2%20] test Simple H foo PD 0:00 1 (QOSMaxCpuPerUserLimit)
79265_41 test Simple H foo R 0:10 1 fnode03
79265_43 test Simple H foo R 0:10 1 fnode03
79265_45 test Simple H foo R 0:10 1 fnode03
79265_47 test Simple H foo R 0:10 1 fnode03
Use the –array option to the squeue command to display one tasks per line:
$ squeue --array
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
79265_65 test Simple H foo PD 0:00 1 (QOSMaxCpuPerUserLimit)
79265_67 test Simple H foo PD 0:00 1 (QOSMaxCpuPerUserLimit)
79265_69 test Simple H foo PD 0:00 1 (QOSMaxCpuPerUserLimit)
79265_97 test Simple H foo PD 0:00 1 (QOSMaxCpuPerUserLimit)
79265_57 test Simple H foo R 0:47 1 fnode03
79265_59 test Simple H foo R 0:47 1 fnode03
79265_61 test Simple H foo R 0:47 1 fnode03
79265_63 test Simple H foo R 0:47 1 fnode03
Examples
Use case 1: 1000 computations, same resource requirements, different input/output arguments
Instead of submitting 1000 individual jobs, submit a single array jobs with 1000 tasks:
#!/bin/bash
#SBATCH --mail-type=NONE
#SBATCH --partition=all
#SBATCH --time=00:30:00 # Each task takes max 30 minutes
#SBATCH --mem-per-cpu=2G # Each task uses max 2G of memory
#SBATCH --array=1-1000 # Submit 1000 tasks with task ID 1,2,...,1000.
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
# The name of the input files must reflect the task ID!
./foo input_data_${SLURM_ARRAY_TASK_ID}.txt > output_${SLURM_ARRAY_TASK_ID}.txt
Task with ID 20 will run the program foo with the following arguments:
./foo input_data_20.txt > output_20.txt
Use case 2: Read arguments from file
Submit an array job with 1000 tasks. Each task executes the program foo with different arguments:
#!/bin/bash
#SBATCH --mail-type=NONE
#SBATCH --partition=all
#SBATCH --time=00:30:00 # Each task takes max 30 minutes
#SBATCH --mem-per-cpu=2G # Each task uses max 2G of memory
#SBATCH --array=1-1000%20 # Submit 1000 tasks with task ID 1,2,...,1000. Run max 20 tasks concurrently
#SBATCH --output=/dev/null
#SBATCH --error=/dev/null
param_store=$HOME/projects/example/args.txt # args.txt contains 1000 lines with 2 arguments per line.
# Line <i> contains arguments for run <i>
data_dir=$HOME/projects/example/input_data # Input files are named input_run_0001.txt,...input_run_1000.txt
result_dir=$HOME/projects/example/results
param_a=$(cat $param_store | awk -v var=$SLURM_ARRAY_TASK_ID 'NR==var {print $1}') # Get first argument
param_b=$(cat $param_store | awk -v var=$SLURM_ARRAY_TASK_ID 'NR==var {print $2}') # Get second argument
n=$(printf "%04d" $SLURM_ARRAY_TASK_ID) # Zero pad the task ID to match the numbering of the input files
./foo -c $param_a -p $param_b -i ${data_dir}/input_run_${n}.txt -o ${result_dir}/result_run_${n}.txt
exit
Related pages: