Skip to content

Scratch

Scratch - temporary file space

Description

Scratch file space are meant for temporary data storage. Interim computational data should be located there. We distinguish local and network scratch spaces.

Network Scratch

Network scratch spaces are located on the parallel file system and accessible to all nodes. In contrast to HOME or WORKSPACEs, scratch is meant for temporary data, especially with larger quota requirements. Jobs creating a lot of temporary data, which may need or may not need to be post-processed should run in this space. As an example, a application creating a huge amount of temporary output, which need to be analysed, but only partly need to be stored for a longer term. Quota and file quota is less restrictive on scratch compared to HOME or permanent WORKSPACE directories. Every user can use up to 30TB and 10M files. There is no snapshot and no backup feature available. Furthermore, an automatic deletion policy is planned, deleting files which are older than 30 days.

Scratch file space can be accessed using the Workspace module and the $SCRATCH environment variable.

module load Workspace
cd $SCRATCH

For personal Scratch see below

Workspace Scratch

Each Workspace has a $SCRATCH space with the same access permissions like the permanent Workspace directory (using primary and secondary groups). The Workspace can be accessed using $SCRATCH variable (after loading the Workspace module). It will point to /storage/scratch/<researchGroupID>/<WorkspaceID>. Please use $SCRATCH to access it.

Personal Scratch

Users without a Workspace can also use “personal” Scratch. This space does need to be created initially:

module load Workspace_Home
mkdir $SCRATCH
cd $SCRATCH

Please note that this space is per default no private space. If you want to restrict access you can change permissions using:

chmod 700 $SCRATCH

Local Scratch

Cases:

  • temporary files are produced, which are not relevant after the computation
  • files need to be read or written multiple times within a job

Local storage ($TMPDIR) should be used instead of network storage.

$TMPDIR is a node local storage which only exists during the job life time and cleaned automatically afterwards. The actual directory is /scratch/local/<jobID>, but it is highly recommended to use $TMPDIR. If necessary data can be copied there initially at the beginning of the job, processes (multiple times) and necessary results copied back at the end.

$TMPDIR instead of /tmp

$TMPDIR is much larger than /tmp and cleaned automatically. Especially in case of job errors data in /tmp will persist and clog the nodes memory.

Example: temporary files

In the following example the example.exe will need a place to store temporary/intermediate files, not necessary after the computation. The location is provided using the --builddir option. And the local scratch ($TMPDDIR) is specified.

#!/bin/bash
#SBATCH --job-name tmpdir
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1

srun example.exe --builddir=$TMPDIR input.dat

If you want to have the advantage of low latency file system (local) but you need to keep files, you still can use $TMPDIR and copy files to the network storage (e.g. $WORKSPACE or $HOME) at the end of your job. This is only efficient if a) more files are manipulated local (in $TMPDIR) than copied to the network storage or b) files are manipulated multiple times, before copying to the network storage.

Example: including data movement

In the following example script, all files from the submitting directory are copied to the head compute node. At the end of the job all files from the compute node local directory is copied back. The compute node local $TMPDIR is used, which points to /scratch/local/<jobid>, a job specific directory in the nodes internal disc.

#!/bin/bash
#SBATCH --job-name tmpdir
#SBATCH --nodes 1
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 1
###SBATCH --output slurm.out # when specifying output file name, add rm slurm.out in cleanup function

# I. Define directory names [DO NOT CHANGE]
# =========================================
# get name of the temporary directory working directory, physically on the compute-node
workdir="${TMPDIR}"
# get submit directory
# (every file/folder below this directory is copied to the compute node)
submitdir="${SLURM_SUBMIT_DIR}"

# 1. Transfer to node [DO NOT CHANGE]
# ===================================
# create/empty the temporary directory on the compute node
if [ ! -d "${workdir}" ]; then
  mkdir -p "${workdir}"
else
  rm -rf "${workdir}"/*
fi

# change current directory to the location of the sbatch command
# ("submitdir" is somewhere in the home directory on the head node)
cd "${submitdir}"
# copy all files/folders in "submitdir" to "workdir"
# ("workdir" == temporary directory on the compute node)
cp -prf * ${workdir}
# change directory to the temporary directory on the compute-node
cd ${workdir}

# 3. Function to transfer back to the head node [DO NOT CHANGE]
# =============================================================
# define clean-up function
function clean_up {
  # - remove log-file on the compute-node, to prevent overwiting actual output with empty file
  rm slurm-${SLURM_JOB_ID}.out
  # - TODO delete temporary files from the compute-node, before copying. Prevent copying unnecessary files.
  # rm -r ...
  # - change directory to the location of the sbatch command (on the head node)
  cd "${submitdir}"
  # - copy everything from the temporary directory on the compute-node
  cp -prf "${workdir}"/* .
  # - erase the temporary directory from the compute-node
  rm -rf "${workdir}"/*
  rm -rf "${workdir}"
  # - exit the script
  exit
}

# call "clean_up" function when this script exits, it is run even if SLURM cancels the job
trap 'clean_up' EXIT

# 2. Execute [MODIFY COMPLETELY TO YOUR NEEDS]
# ============================================
# TODO add your computation here
# simple example, hello world
srun echo "hello world from $HOSTNAME"

Further aspects to consider:

  • copy only necessary files
    • have only necessary files in the submit directory
    • remove all unnecessary files before copying the data back, e.g. remove large input files
  • In case of a parallel job, you need to verify that all process run on one single node (--nodes=1) OR copy the data to all related nodes (e.g. srun -n1 cp ...).