Using node-local SSDs#
The clusters Alex, TinyFat, TinyGPU, and Woody have node-local SSDs that can be used as scratch space or as cache to increase I/O bandwidth, e.g. for training.
The node-local SSDs are accessible via the environment variable $TMPDIR
.
Staging data in means copying files to $TMPDIR
from a slower filesystem, e.g. $WORK
at the beginning of a job and use the data from there.
Staging data out means copying data from $TMPDIR
to $WORK
or any other filesystem.
This is only necessary if you created or updated data on
$TMPDIR
during the job and want to keep it.
Staging data in and out#
The following example script stages data in to
$TMPDIR
at the beginning of the job
and optionally stages data out at the end of the job:
#!/bin/bash -l
#SBATCH --gres=gpu:<GPU>:<NGPUS>
#SBATCH --time=<TIME>
# ...
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
module add python
conda activate YOUR_ENVIRONMENT
# TODO: replace with your code to stage data to $TMPDIR
# in case you have to extract an archive, e.g. a dataset use:
tar xf "$WORK/dataset.tar" "$TMPDIR"
# use this of your dataset contains several thousand files
# copy data to `$TMPDIR`
# cp -r "$WORK/your-datasets" "$TMPDIR"
# use this if you use archive data formats, e.g. hdf5 or similar
# and your dataset contains \< 100 files
python3 train.py --dataset-path "$TMPDIR" --workdir "$TMPDIR" ...
# OPTIONAL: copy new/updated data to $WORK before it gets deleted
cp -r "$TMPDIR/results" "$WORK"
Share staged data with concurrently running jobs on the same node#
This allows to stage data to $TMPDIR
only once per node and share this data with all of your jobs that are running concurrently on this
node.
There is no reliable way to schedule your jobs concurrently on the same node
If by chance your jobs are schedule onto the same node this approach will save you from copying the data multiple times.
If the jobs are running on separate nodes this will have no benefit, but also does not cause any overhead.
Do not force your jobs onto a specific node. Slurm is not obliged to run them a the same time. It will just cause much longer waiting times.
Prerequisites:
- You stage data to
$TMPDIR
. - You access data on
$TMPDIR
read-only. - A certain class of your jobs uses the same data.
Result:
If a new job is started on a node, where one of your jobs is already running, the new job does
not have to stage any data and can use the already existing data from
the node local $STAGING_DIR
.
Script template#
The following script template shares staged data via a directory called $STAGING_DIR
.
In the script you have to adjust the places marked with TODO
.
What the script does:
- a job of a certain class
$JOB_CLASS
will stage its data into the$STAGING_DIR
directory, if no other job of the same class is already running - additional jobs with the same
$JOB_CLASS
that are scheduled onto a node where a job of the same class is already running, will use the already staged data from$STAGING_DIR
.
Adjustments:
- job class:
readonly JOB_CLASS="TODO"
- Replace
TODO
with a string that identifies a group of your jobs that share the same data. - For example, if you have jobs that use data set A and other jobs that use a different data
set B, then for the purpose of data sharing the first jobs could have job class
A
and the latter jobs jobs classB
.
- Replace
- actual data staging commands:
# TODO: place here the code to copy data to $STAGING_DIR
- Insert here the commands to copy data to
$STAGING_DIR
. - These are probably the commands you used before to copy data to
$TMPDIR
.
- Insert here the commands to copy data to
#!/bin/bash -l
#SBATCH --gres=gpu:<GPU>:<NGPUS>
#SBATCH --time=<TIME>
# ...
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# TODO: give a specific name for each job class.
readonly JOB_CLASS="TODO"
# $STAGING_DIR: place shared data there
readonly STAGING_DIR="/tmp/$USER-$JOB_CLASS"
# create staging directory, abort if it fails
(umask 0077; mkdir -p "$STAGING_DIR") || { echo "ERROR: creating $STAGING_DIR failed"; exit 1; }
# only one job is allowed to stage data, if others run at the same time they
# have to wait to avoid a race
(
exec {FD}>"$STAGING_DIR/.lock"
flock "$FD"
# check if another job has staged data already
if [ ! -f "$STAGING_DIR/.complete" ]; then
# START OF STAGING
# -------------------------------------------------------
# TODO: place here the code to copy data to $STAGING_DIR
# -------------------------------------------------------
# END OF STAGING
: > "$STAGING_DIR/.complete"
fi
)
# BELOW THIS LINE DATA STAGED TO $STAGING_DIR CAN BE USED.
# run application, use data from $STAGING_DIR