FAQ#
General information#
What should I use as acknowledgment in publications?
Formulations can be found under Acknowledgment.
I have questions regarding HPC export controls. Who can advise me?
Advice is available for FAU members at exportkontrolle@fau.de.
The relevant department at the home institute is responsible for external users.
If you are unsure who is responsible for you, you can also contact us.
Accessing HPC#
How can I get access to HPC systems?
Depending on your status, there are different ways to get an HPC account. More information is available under Getting an account.
How can I access the Alex and Fritz clusters?
FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/.
Access is restricted to projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.
External scientists have to submit an NHR proposal to get access.
SSH#
SSH is asking for a password, but I do not have one.
HPC accounts that were created through the HPC portal do not have a password. Authentication is done through SSH keys only.
The following steps are necessary before you can log into the clusters:
- generate an SSH key pair via OpenSSH/command line or via MobaXTerm (Windows)
- upload your public SSH key to the HPC portal
- configure your SSH connection to use SSH keys for authentication with OpenSSH/command line or via MobaXTerm (Windows)
How can I access the cluster frontends?
Almost all HPC systems at NHR@FAU use private IPv4 addresses that can only be accessed directly from within the FAU network. There are several options for outside users to connect. More information is available in Access to NHR@FAU systems.
Whichever option you choose, you need to use SSH to connect to the clusters. Documentation on how to set up SSH is available for OpenSSH/command line usage and MobaXTerm (Windows)
I managed to log in to csnhr
(with an SSH key) but get asked for a password / permission denied when continuing to a cluster frontend
The dialog server csnhr
does not know your SSH (private) key and therefore fails to do SSH key-based authentication when connecting to one of the cluster frontends.
There are a couple of solutions to mitigate this:
- Use the proxy jump feature of SSH to directly connect from your computer to the cluster frontends. Thereby the connection is automatically tunneled through
csnhr
. We provide templates for SSH and a guide for MobaXTerm (Windows). - Create an additional SSH key pair on
csnhr
and add the corresponding SSH public key to the HPC portal. - Use an SSH agent on your local computer and allow it to forward its connection.
Debugging SSH connection issues
See SSH Troubleshooting for several options.
How can I access an application on a cluster nodes through port forwarding?
Some applications running on cluster nodes provide a web application, e.g. Jupyter Notebooks. To access these applications directly from your computer's browser, port forwarding is required. See connecting to cluster nodes for the necessary configuration.
HPC Portal#
I received an invite mail from the HPC-Portal but there is no account data visible
We can only match invitations that have been sent to the email address that is transmitted via SSO.
To check this, login to the HPC portal and click on your SSO name in the upper right corner. Go to "Profile". The transmitted email address that is visible below "Personal data".
If this address does not match to one from your invitation, please ask for the invitation to be resend to the correct email address.
What is the password for my HPC account? How can I change my password?
For HPC accounts that are managed through the HPC portal, there is no password. Access to the HPC systems is by SSH keys only, which have to be uploaded to the portal. More information is available on generating SSH keys and uploading SSH keys.
My just updated SSH key (from the HPC portal) is not accepted
It always takes a couple of hours for updated SSH keys to be propagated to all HPC systems. As the clusters are synchronized at different points in time, it may happen that one system already knows the updated key while others don't. It typically takes 2-4 hours for an updated key to be propagated to all systems.
My HPC account has just been created but I cannot login or Slurm rejects my jobs
After account creation it will take until the next morning until the account becomes usable, i.e., all file system folders are created and the Slurm database on the clusters is updated. Thus, please be patient.
How can I access ClusterCockpit / monitoring.nhr.fau.de
when I don't have a password?
For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.
Please login to the HPC portal and follow the link to ClusterCockpit in your account details to generate a valid ClusterCockpit session. Sessions are valid for several hours/days.
I manage a Tier3-project in the HPC portal. Which of the project categories is the correct one for the new account I want to add?
A description of the different project categories can be found here.
Batch system Slurm#
Why is my job not starting after I have submitted it to the queue?
The batch system automatically assigns a priority to each waiting job. This priority value depends on certain parameters like waiting time, partition, user group, and recently used CPU/GPU time (a.k.a. fairshare).
If you have been submitting many jobs lately and thus, used a large amount of compute time, your priority will decrease, so that other users' jobs can run, too.
How can I run my job on the cluster?
All computational intensive work has to be submitted to the cluster nodes via the batch system (Slurm), which handles the resource distribution among different users and jobs according to a priority scheme.
For general information about how to use the batch system, see Slurm batch system.
Example batch scripts can be found here:
- general job script examples
- cluster-specific job scripts in the respective cluster documentation
- job scripts for some specific applications
Slurm options get ignored when given as sbatch
command line arguments
The syntax of sbatch
is:
Thus, options for sbatch
have to be given before the batch script. Arguments given after the batch script are used as arguments for the script and not for sbatch
.
How can I request an interactive job on a cluster?
Interactive jobs can be requested by using salloc
and specifying the respective options on the command line.
They are useful for testing or debugging of your application.
More information on interactive jobs is available here:
- general documentation on
salloc
- cluster-specific example in the respective cluster documentation
Settings from the calling shell (e.g. loaded module paths) will automatically be inherited by the interactive job. To avoid issues in your interactive job, purge all loaded modules via module purge
before issuing the salloc
command.
How can I attach to a running Slurm job?
See our documentation under attach to a running job.
Attaching to a running job can be used as an alternative to connecting to the node via SSH.
If you have multiple GPU jobs running on the same compute node, attaching via srun
is the only way to see the correct GPU for a specific job.
Error module: command not found
If the module
command cannot be found that usually means that you did not invoke the bash shell with the option -l
(lower case L) for a login shell.
Thus, job scripts, etc. should always start with
How can I request a specific type of A100 GPUs?
For the A100 GPU with 80 GB memory, use -C a100_80
in the Slurm script. Alternatively, use -C a100_40
for the A100 GPUs with 40 GB memory.
Software#
The software I need is not installed. What can I do now?
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs
module.
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
Why does my application give an http/https timeout?
By default compute nodes cannot access the internet directly. This will result in connection timeouts.
To circumvent this you have to configure a proxy server. Enter the following command either in an interactive job or add them to your job script:
Some applications may expect the variables in capital letters instead:
Why is my application not using the GPU?
If you are using PyTorch or TensorFlow, your installation might not support GPUs. See installation of PyTorch and TensorFlow for the correct procedure.
Otherwise, CUDA or CUDA related libraries could be missing. Ensure you have the required
module(s) in the correct version loaded, like cuda
, cudnn
, or tensorrt
.
How to fix conda error NoWriteEnvsDirError
or NoWritePkgsDirError
You are missing some configuration of conda.
After loading the python
module:
-
for error
NoWriteEnvsDirError
execute: -
for error
NoWritePkgsDirError
execute:
Check with conda info
that the path $WORK/software/private/conda/envs
or $WORK/software/private/conda/pkgs
is included in the output.
File systems / data storage#
Why is my data taking up twice as much space on the file systems?
Currently, all data on /home/hpc
and /home/vault
is replicated to two different disc arrays. Unfortunately, due to the way this is implemented, it means that everything you store will at the moment be counted towards your quota usage twice. So for example, if you store 1 GB of data on $HOME
or $HPCVAULT
, you will currently use 2 GB of your quota. We have temporarily doubled all quotas now to accommodate for that.
See our post on quota for more details.
How can I share data between HPC accounts?
See sharing data for details.
How can I use node-local storage $TMPDIR
on Alex or TinyGPU to increase job performance?
Each node has at least 1.8 TB of local SSD capacity for temporary files under $TMPDIR
.
$TMPDIR
will be deleted automatically, when the job ends.
Data to be kept must be copied to one of our network filesystems.
I have to analyze over 2 million files in my job. What can I do?
Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.
If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your
files into an archive (e.g. tar with optional compression) and use node-local storage that is accessible via
$TMPDIR
.
Why should I care about file systems?
Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.
What is file system metadata?
Metadata comprises all the bookkeeping information in a file system: file sizes, permissions, modification and access times, etc. A workload that, e.g., opens and closes files in rapid succession leads to frequent metadata accesses, putting a lot of strain on any file server infrastructure. This is why a small number of users with inappropriate workload can slow down file operations to a crawl for everyone. Note also that especially parallel file systems are ill-suited for metadata-heavy operations.
What is a parallel file system?
In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS's are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.
For information on how to use a parallel file system on our clusters, please read our documentation on Parallel file system $FASTTMP
.
Why the need for several file systems?
Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node.
For further information see File systems.
Where can I store my data?
See File systems for an overview of available storage locations and their properties.
Why do I get errors when reading files on the cluster that I transferred from my workstation?
Text files are commonly stored in an operating system specific format. While they look the same, they contain non-printable characters, .e.g the new line code sequence. For further information on this, see Wikipedia. All frontend nodes provide the tools to convert between these line ending code sequences:
Source | Destination | Tool |
---|---|---|
Windows | Linux | dos2unix |
MacOS | Linux | mac2unix |
Linux | Windows | unix2dos |
Linux | MacOS | unix2mac |
Choose the suitable converter for your environment.
- Update file in-place:
- Create a new file:
Hardware#
What is thread or process affinity?
Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.
See also our documentation on OpenMP thread binding and MPI process binding.
What is SMT or hyperthreading?
Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These "hardware threads" a.k.a. "virtual cores" share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.
SMT is disabled on almost all NHR@FAU systems.
Advanced Topics#
Efficient RDMA/Infiniband communication in Apptainer/Singularity container
Make sure to include rdma-core
, libibverbs1
, etc. in your image.
Check (debug) the output (see Debugging NCCL. For NCCL as used by Pytorch, etc. check for an error such as NCCL INFO Failed to open libibverbs.so[.1]
.
When running on our nodes directly (without container), we take care that libs, etc. are installed.
Debugging NCCL
The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL has an extensive set of environment variables to tune for specific usage: NCCL user guide
Setting the environment variable NCCL_DEBUG=INFO
will result in massiv debug information; NCCL_DEBUG=WARN
will show a limited set of warnings.
- The message
NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found
seems to be uncritical. NCCL INFO Failed to open libibverbs.so[.1]
indicates that you are probably running a container which does not includerdma-core
andlibibverbs1
; thus, Infiniband/RDMA cannot be used- on the A100 nodes of Alex you should see a line similar to
NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_3:1/IB [RO]; OOB ib0:10.28.##.###<0>
when Infiniband/RDMA is properly detected;mlx5_1:1/RoCE
(the 25 GbE device) might also appear. We are not sure if the 25 GbE RoCE device does any harm. The usage of only the Infiniband/RDMA devices can be enforced by setting the environment variableexport NCCL_IB_HCA="=mlx5_0:1,mlx5_3:1"
- the device names must match the mlx5.../IB output from the NCCL INFO NET/IB line! - on the A40 nodes of Alex, there is only the 25 GbE RoCE device available resulting in a line similar to
NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE
- THIS OUTPUT HAS NOT YET BEEN VERIFIED!