Skip to content

FAQ#

Expand all entries

General information#

What should I use as acknowledgment in publications?

Formulations can be found under Acknowledgment.

I have questions regarding HPC export controls. Who can advise me?

Advice is available for FAU members at exportkontrolle@fau.de.

The relevant department at the home institute is responsible for external users.

If you are unsure who is responsible for you, you can also contact us.

Accessing HPC#

How can I get access to HPC systems?

Depending on your status, there are different ways to get an HPC account. More information is available under Getting an account.

How can I access the Alex and Fritz clusters?

FAU staff and students who already have an HPC account can request access to Alex here: https://hpc.fau.de/tier3-access-to-alex/ and access to Fritz here: https://hpc.fau.de/tier3-access-to-fritz/.

Access is restricted to projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.

External scientists have to submit an NHR proposal to get access.

SSH#

SSH is asking for a password, but I do not have one.

HPC accounts that were created through the HPC portal do not have a password. Authentication is done through SSH keys only.

The following steps are necessary before you can log into the clusters:

How can I access the cluster frontends?

Almost all HPC systems at NHR@FAU use private IPv4 addresses that can only be accessed directly from within the FAU network. There are several options for outside users to connect. More information is available in Access to NHR@FAU systems.

Whichever option you choose, you need to use SSH to connect to the clusters. Documentation on how to set up SSH is available for OpenSSH/command line usage and MobaXTerm (Windows)

I managed to log in to csnhr (with an SSH key) but get asked for a password / permission denied when continuing to a cluster frontend

The dialog server csnhr does not know your SSH (private) key and therefore fails to do SSH key-based authentication when connecting to one of the cluster frontends.

There are a couple of solutions to mitigate this:

  • Use the proxy jump feature of SSH to directly connect from your computer to the cluster frontends. Thereby the connection is automatically tunneled through csnhr. We provide templates for SSH and a guide for MobaXTerm (Windows).
  • Create an additional SSH key pair on csnhr and add the corresponding SSH public key to the HPC portal.
  • Use an SSH agent on your local computer and allow it to forward its connection.
Debugging SSH connection issues

See SSH Troubleshooting for several options.

How can I access an application on a cluster nodes through port forwarding?

Some applications running on cluster nodes provide a web application, e.g. Jupyter Notebooks. To access these applications directly from your computer's browser, port forwarding is required. See connecting to cluster nodes for the necessary configuration.

HPC Portal#

I received an invite mail from the HPC-Portal but there is no account data visible

We can only match invitations that have been sent to the email address that is transmitted via SSO.

To check this, login to the HPC portal and click on your SSO name in the upper right corner. Go to "Profile". The transmitted email address that is visible below "Personal data".

If this address does not match to one from your invitation, please ask for the invitation to be resend to the correct email address.

What is the password for my HPC account? How can I change my password?

For HPC accounts that are managed through the HPC portal, there is no password. Access to the HPC systems is by SSH keys only, which have to be uploaded to the portal. More information is available on generating SSH keys and uploading SSH keys.

My just updated SSH key (from the HPC portal) is not accepted

It always takes a couple of hours for updated SSH keys to be propagated to all HPC systems. As the clusters are synchronized at different points in time, it may happen that one system already knows the updated key while others don't. It typically takes 2-4 hours for an updated key to be propagated to all systems.

My HPC account has just been created but I cannot login or Slurm rejects my jobs

After account creation it will take until the next morning until the account becomes usable, i.e., all file system folders are created and the Slurm database on the clusters is updated. Thus, please be patient.

How can I access ClusterCockpit / monitoring.nhr.fau.de when I don't have a password?

For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.

Please login to the HPC portal and follow the link to ClusterCockpit in your account details to generate a valid ClusterCockpit session. Sessions are valid for several hours/days.

I manage a Tier3-project in the HPC portal. Which of the project categories is the correct one for the new account I want to add?

A description of the different project categories can be found here.

Batch system Slurm#

Why is my job not starting after I have submitted it to the queue?

The batch system automatically assigns a priority to each waiting job. This priority value depends on certain parameters like waiting time, partition, user group, and recently used CPU/GPU time (a.k.a. fairshare).

If you have been submitting many jobs lately and thus, used a large amount of compute time, your priority will decrease, so that other users' jobs can run, too.

How can I run my job on the cluster?

All computational intensive work has to be submitted to the cluster nodes via the batch system (Slurm), which handles the resource distribution among different users and jobs according to a priority scheme.

For general information about how to use the batch system, see Slurm batch system.

Example batch scripts can be found here:

Slurm options get ignored when given as sbatch command line arguments

The syntax of sbatch is:

sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script [args ...]

Thus, options for sbatch have to be given before the batch script. Arguments given after the batch script are used as arguments for the script and not for sbatch.

How can I request an interactive job on a cluster?

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

They are useful for testing or debugging of your application.

More information on interactive jobs is available here:

Settings from the calling shell (e.g. loaded module paths) will automatically be inherited by the interactive job. To avoid issues in your interactive job, purge all loaded modules via module purge before issuing the salloc command.

How can I attach to a running Slurm job?

See our documentation under attach to a running job.

Attaching to a running job can be used as an alternative to connecting to the node via SSH.

If you have multiple GPU jobs running on the same compute node, attaching via srun is the only way to see the correct GPU for a specific job.

Error module: command not found

If the module command cannot be found that usually means that you did not invoke the bash shell with the option -l (lower case L) for a login shell.

Thus, job scripts, etc. should always start with

#!/bin/bash -l
How can I request a specific type of A100 GPUs?

For the A100 GPU with 80 GB memory, use -C a100_80 in the Slurm script. Alternatively, use -C a100_40 for the A100 GPUs with 40 GB memory.

Software#

The software I need is not installed. What can I do now?

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Why does my application give an http/https timeout?

By default compute nodes cannot access the internet directly. This will result in connection timeouts.

To circumvent this you have to configure a proxy server. Enter the following command either in an interactive job or add them to your job script:

export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Some applications may expect the variables in capital letters instead:

export HTTP_PROXY=http://proxy:80
export HTTPS_PROXY=http://proxy:80
Why is my application not using the GPU?

If you are using PyTorch or TensorFlow, your installation might not support GPUs. See installation of PyTorch and TensorFlow for the correct procedure.

Otherwise, CUDA or CUDA related libraries could be missing. Ensure you have the required module(s) in the correct version loaded, like cuda, cudnn, or tensorrt.

How to fix conda error NoWriteEnvsDirError or NoWritePkgsDirError

You are missing some configuration of conda.

After loading the python module:

  • for error NoWriteEnvsDirError execute:

    conda config --add envs_dirs $WORK/software/private/conda/envs
    
  • for error NoWritePkgsDirError execute:

    conda config --add pkgs_dirs $WORK/software/private/conda/pkgs
    

Check with conda info that the path $WORK/software/private/conda/envs or $WORK/software/private/conda/pkgs is included in the output.

File systems / data storage#

Why is my data taking up twice as much space on the file systems?

Currently, all data on /home/hpc and /home/vault is replicated to two different disc arrays. Unfortunately, due to the way this is implemented, it means that everything you store will at the moment be counted towards your quota usage twice. So for example, if you store 1 GB of data on $HOME or $HPCVAULT, you will currently use 2 GB of your quota. We have temporarily doubled all quotas now to accommodate for that.

See our post on quota for more details.

How can I share data between HPC accounts?

See sharing data for details.

How can I use node-local storage $TMPDIR on Alex or TinyGPU to increase job performance?

Each node has at least 1.8 TB of local SSD capacity for temporary files under $TMPDIR.

$TMPDIR will be deleted automatically, when the job ends. Data to be kept must be copied to one of our network filesystems.

I have to analyze over 2 million files in my job. What can I do?

Please go through the presentation we provide on Using File Systems Properly; there is also a video recording available.

If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your files into an archive (e.g. tar with optional compression) and use node-local storage that is accessible via $TMPDIR.

Why should I care about file systems?

Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.

What is file system metadata?

Metadata comprises all the bookkeeping information in a file system: file sizes, permissions, modification and access times, etc. A workload that, e.g., opens and closes files in rapid succession leads to frequent metadata accesses, putting a lot of strain on any file server infrastructure. This is why a small number of users with inappropriate workload can slow down file operations to a crawl for everyone. Note also that especially parallel file systems are ill-suited for metadata-heavy operations.

What is a parallel file system?

In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS's are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.

For information on how to use a parallel file system on our clusters, please read our documentation on Parallel file system $FASTTMP.

Why the need for several file systems?

Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node.

For further information see File systems.

Where can I store my data?

See File systems for an overview of available storage locations and their properties.

Why do I get errors when reading files on the cluster that I transferred from my workstation?

Text files are commonly stored in an operating system specific format. While they look the same, they contain non-printable characters, .e.g the new line code sequence. For further information on this, see Wikipedia. All frontend nodes provide the tools to convert between these line ending code sequences:

Source Destination Tool
Windows Linux dos2unix
MacOS Linux mac2unix
Linux Windows unix2dos
Linux MacOS unix2mac

Choose the suitable converter for your environment.

  • Update file in-place:
    dos2unix filename
    
  • Create a new file:
    dos2unix -n filename outfile
    

Hardware#

What is thread or process affinity?

Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.

See also our documentation on OpenMP thread binding and MPI process binding.

What is SMT or hyperthreading?

Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These "hardware threads" a.k.a. "virtual cores" share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.

SMT is disabled on almost all NHR@FAU systems.

Advanced Topics#

Efficient RDMA/Infiniband communication in Apptainer/Singularity container

Make sure to include rdma-core, libibverbs1, etc. in your image.

Check (debug) the output (see Debugging NCCL. For NCCL as used by Pytorch, etc. check for an error such as NCCL INFO Failed to open libibverbs.so[.1].

When running on our nodes directly (without container), we take care that libs, etc. are installed.

Debugging NCCL

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL has an extensive set of environment variables to tune for specific usage: NCCL user guide

Setting the environment variable NCCL_DEBUG=INFO will result in massiv debug information; NCCL_DEBUG=WARN will show a limited set of warnings.

  • The message NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found seems to be uncritical.
  • NCCL INFO Failed to open libibverbs.so[.1] indicates that you are probably running a container which does not include rdma-core and libibverbs1; thus, Infiniband/RDMA cannot be used
  • on the A100 nodes of Alex you should see a line similar to NCCL INFO NET/IB : Using [0]mlx5_0:1/IB [1]mlx5_3:1/IB [RO]; OOB ib0:10.28.##.###<0> when Infiniband/RDMA is properly detected; mlx5_1:1/RoCE (the 25 GbE device) might also appear. We are not sure if the 25 GbE RoCE device does any harm. The usage of only the Infiniband/RDMA devices can be enforced by setting the environment variable export NCCL_IB_HCA="=mlx5_0:1,mlx5_3:1" - the device names must match the mlx5.../IB output from the NCCL INFO NET/IB line!
  • on the A40 nodes of Alex, there is only the 25 GbE RoCE device available resulting in a line similar to NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE - THIS OUTPUT HAS NOT YET BEEN VERIFIED!