ML I/O issues#

Most I/O issues can be avoided by using $TMPDIR

During a job you can use the node-local $TMPDIR of GPU nodes. These nodes have a local SSD which provides higher bandwidth and lower latency than you get from $HOME, $HPCVAULT, and $WORK.

See Using node-local $TMPDIR for examples.

NFS vs node-local SSD — Stage data to `$TMPDIR` on nodes with SSDs, as it provides higher bandwidth and lower latency than `$HOME`, `$HPCVAULT`, and `$WORK`.

Datasets containing massive amounts of files#

Some datasets contain massive mounts of files and are not suited to be stored unpacked on $HPCVAULT or $WORK.

Instead, keep them packed in an archive like tar or zip and only unpack them to $TMPDIR when your job starts. See Using node-local $TMPDIR for examples. Depending on your data it might make sense to use or not use compression. If in doubt, do your own benchmarks.

Another alternative is our NVMe Lustre storage system, which can be used as a temporary high-performance workspace on Alex. This service is currently in beta, but we are thankful for any experience reports.

Causes for slow filesystems#

The servers providing $HOME, $HPCVAULT, and $WORK are shared by all users. Hence, generating heavy load on these servers impacts all users.

You might experience this, for example, when your login to our systems takes exceptionally long or you have to wait minutes until your Python/conda environment is loaded.

Often high load accidentally occurs during training when massive amounts of files are used. Operations that can lead to this situation can include:

accessing massive amounts of files
querying meta data (like creation, modification, time) of lots of files
open/closing lots of files
listing directories with lots of files
creating lots of small files

See Using node-local $TMPDIR for examples on how to use $TMPDIR instead.

ML I/O issues#

Datasets containing massive amounts of files#

Causes for slow filesystems#

Further information#