Parallel BZIP2
Data frequently needs to be packed and compressed for archiving or transfer. There are multiple tools available like tar and gzip, bzip. Pbzip2 is a parallel implementation of bzip2. For general information see bzip2 and pbzip2. The tool is available on all nodes without loading any module.
Usage
Parallel packing and compressing can be performed on a compute node using tar
with a specified number of threads
As an example, a file or directory /data/to/pack
can be packed and compressed into a file /packed/file.tar.bz2
using the job script:
#SBATCH --job-name="pbzip2"
#SBATCH --time=01:00:00
#SBATCH --mem-per-cpu=2G
## For parallel jobs with 8 cores
#SBATCH --cpus-per-task=8 ## select the amount of cores required
source="data/to/pack" ## specify your data to compress
target="/packed/file.tar.bz2" ## specify directory and filename
# archive dir data_unibe to a tar file and compress it using pbzip2
srun tar -cS $source | pbzip2 -p$SLURM_CPUS_PER_TASK > $target
# Generate a sha256 fingerprint, to later check the integrity
sha256sum $target > ${target}.sha256sum