ML I/O issues#
Most I/O issues can be avoided by using $TMPDIR
During a job you can use the node-local
$TMPDIR
of GPU nodes.
These nodes have a local SSD which provides higher bandwidth and lower latency than you
get from $HOME
, $HPCVAULT
, and $WORK
.
See Using node-local $TMPDIR
for examples.
Datasets containing massive amounts of files#
Some datasets contain massive mounts of files and are not suited to be stored unpacked
on $HPCVAULT
or $WORK
.
Instead, keep them packed in an archive like tar
or zip
and only unpack them to
$TMPDIR
when your job starts.
See Using node-local $TMPDIR
for examples.
Depending on your data it might make sense to use or not use compression.
If in doubt, do your own benchmarks.
Another alternative is our NVMe Lustre storage system, which can be used as a temporary high-performance workspace on Alex. This service is currently in beta, but we are thankful for any experience reports.
Causes for slow filesystems#
The servers providing $HOME
,
$HPCVAULT
, and $WORK
are shared by all users.
Hence, generating heavy load on these servers impacts all users.
You might experience this, for example, when your login to our systems takes exceptionally long or you have to wait minutes until your Python/conda environment is loaded.
Often high load accidentally occurs during training when massive amounts of files are used. Operations that can lead to this situation can include:
- accessing massive amounts of files
- querying meta data (like creation, modification, time) of lots of files
- open/closing lots of files
- listing directories with lots of files
- creating lots of small files
See Using node-local $TMPDIR
for examples on how to use $TMPDIR
instead.