Using node-local SSDs#

The clusters Alex, TinyFat, TinyGPU, and Woody have node-local SSDs that can be used as scratch space or as cache to increase I/O bandwidth, e.g. for training.

The node-local SSDs are accessible via the environment variable $TMPDIR.

NFS vs node-local SSD — `$TMPDIR`] on nodes with SSDs provides higher bandwidth and lower latency than `$HOME`, `$HPCVAULT`, and `$WORK`.

Staging data in means copying files to $TMPDIR from a slower filesystem, e.g. $WORK at the beginning of a job and use the data from there.

Staging data out means copying data from $TMPDIR to $WORK or any other filesystem. This is only necessary if you created or updated data on $TMPDIR during the job and want to keep it.

Staging data in and out#

The following example script stages data in to $TMPDIR at the beginning of the job and optionally stages data out at the end of the job:

#!/bin/bash -l
#SBATCH --gres=gpu:<GPU>:<NGPUS>
#SBATCH --time=<TIME>
# ...
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV

module add python
conda activate YOUR_ENVIRONMENT

# TODO: replace with your code to stage data to $TMPDIR
# in case you have to extract an archive, e.g. a dataset use:
 tar xf "$WORK/dataset.tar" "$TMPDIR"
# use this of your dataset contains several thousand files

# copy data to `$TMPDIR`
# cp -r "$WORK/your-datasets" "$TMPDIR"
# use this if you use archive data formats, e.g. hdf5 or similar
# and your dataset contains \< 100 files


python3 train.py --dataset-path "$TMPDIR" --workdir "$TMPDIR" ...

# OPTIONAL: copy new/updated data to $WORK before it gets deleted
cp -r "$TMPDIR/results" "$WORK"

This allows to stage data to $TMPDIR only once per node and share this data with all of your jobs that are running concurrently on this node.

There is no reliable way to schedule your jobs concurrently on the same node

If by chance your jobs are schedule onto the same node this approach will save you from copying the data multiple times.

If the jobs are running on separate nodes this will have no benefit, but also does not cause any overhead.

Do not force your jobs onto a specific node. Slurm is not obliged to run them a the same time. It will just cause much longer waiting times.

Prerequisites:

You stage data to $TMPDIR.
You access data on $TMPDIR read-only.
A certain class of your jobs uses the same data.

Result:

If a new job is started on a node, where one of your jobs is already running, the new job does not have to stage any data and can use the already existing data from the node local $STAGING_DIR.

Script template#

The following script template shares staged data via a directory called $STAGING_DIR.

In the script you have to adjust the places marked with TODO.

What the script does:

a job of a certain class $JOB_CLASS will stage its data into the $STAGING_DIR directory, if no other job of the same class is already running
additional jobs with the same $JOB_CLASS that are scheduled onto a node where a job of the same class is already running, will use the already staged data from $STAGING_DIR.

Adjustments:

job class: readonly JOB_CLASS="TODO"
- Replace TODO with a string that identifies a group of your jobs that share the same data.
- For example, if you have jobs that use data set A and other jobs that use a different data set B, then for the purpose of data sharing the first jobs could have job class A and the latter jobs jobs class B.
actual data staging commands: # TODO: place here the code to copy data to $STAGING_DIR
- Insert here the commands to copy data to $STAGING_DIR.
- These are probably the commands you used before to copy data to $TMPDIR.

Template for sharing staged data

#!/bin/bash -l
#SBATCH --gres=gpu:<GPU>:<NGPUS>
#SBATCH --time=<TIME>
# ...
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV

# TODO: give a specific name for each job class.
readonly JOB_CLASS="TODO"

# $STAGING_DIR: place shared data there
readonly STAGING_DIR="/tmp/$USER-$JOB_CLASS"

# create staging directory, abort if it fails
(umask 0077; mkdir -p "$STAGING_DIR") || { echo "ERROR: creating $STAGING_DIR failed"; exit 1; }

# only one job is allowed to stage data, if others run at the same time they
# have to wait to avoid a race
(
  exec {FD}>"$STAGING_DIR/.lock"
  flock "$FD"

  # check if another job has staged data already
  if [ ! -f "$STAGING_DIR/.complete" ]; then
    # START OF STAGING

    # -------------------------------------------------------
    # TODO: place here the code to copy data to $STAGING_DIR
    # -------------------------------------------------------


    # END OF STAGING 
    : > "$STAGING_DIR/.complete"
  fi
)

# BELOW THIS LINE DATA STAGED TO $STAGING_DIR CAN BE USED.


# run application, use data from $STAGING_DIR

Using node-local SSDs#

Staging data in and out#

Share staged data with concurrently running jobs on the same node#

Script template#