Skip to content

ML I/O issues#

Most I/O issues can be avoided by using $TMPDIR

During a job you can use the node-local $TMPDIR of GPU nodes. These nodes have a local SSD which provides higher bandwidth and lower latency than you get from $HOME, $HPCVAULT, and $WORK.

See Using node-local $TMPDIR for examples.

NFS vs node-local SSD

Stage data to $TMPDIR on nodes with SSDs, as it provides higher bandwidth and lower latency than $HOME, $HPCVAULT, and $WORK.

Datasets containing massive amounts of files#

Some datasets contain massive mounts of files and are not suited to be stored unpacked on $HPCVAULT or $WORK.

Instead, keep them packed in an archive like tar or zip and only unpack them to $TMPDIR when your job starts. See Using node-local $TMPDIR for examples.

Depending on your data it might make sense to use or not use compression. If in doubt, do your own benchmarks.

Causes for slow filesystems#

The servers providing $HOME, $HPCVAULT, and $WORK are shared by all users. Hence, generating heavy load on these servers impacts all users.

You might experience this, for example, when your login to our systems takes exceptionally long or you have to wait minutes until your Python/conda environment is loaded.

Often high load accidentally occurs during training when massive amounts of files are used. Operations that can lead to this situation can include:

  • accessing massive amounts of files
  • querying meta data (like creation, modification, time) of lots of files
  • open/closing lots of files
  • listing directories with lots of files
  • creating lots of small files

See Using node-local $TMPDIR for examples on how to use $TMPDIR instead.

Further information#