Formulations can be found under Acknowledgment.
Depending on your status, there are different ways to get an HPC account. More information is available under Getting an account.
Access is restricted to projects with extended demands, thus, not feasible on TinyGPU or Woody/Meggie, but still below the NHR thresholds. You have to prove that and provide a short description of what you want to do there.
External scientists have to submit an NHR proposal to get access.
HPC accounts that were created through the HPC portal do not have a password. Authentication is done through SSH keys only.
The following steps are necessary before you can log into the clusters:
Almost all HPC systems at NHR@FAU use private IPv4 addresses that can only be accessed directly from within the FAU network. There are several options for outside users to connect. More information is available in Access to NHR@FAU systems.
I managed to log in to
csnhr (with an SSH key) but get asked for a password / permission denied when continuing to a cluster frontend
The dialog server
csnhr does not know your SSH (private) key and therefore fails to do SSH key-based authentication when connecting to one of the cluster frontends.
There are a couple of solutions to mitigate this:
- Use the proxy jump feature of SSH to directly connect from your computer to the cluster frontends. Thereby the connection is automatically tunneled through
csnhr. We provide templates for SSH and a guide for MobaXTerm (Windows).
- Create an additional SSH key pair on
csnhrand add the corresponding SSH public key to the HPC portal.
- Use an SSH agent on your local computer and allow it to forward its connection.
See SSH Troubleshooting for several options.
Some applications running on cluster nodes provide a web application, e.g. Jupyter Notebooks. To access these applications directly from your computer's browser, port forwarding is required. See connecting to cluster nodes for the necessary configuration.
We can only match invitations that have been sent to the email address that is transmitted via SSO.
To check this, login to the HPC portal and click on your SSO name in the upper right corner. Go to "Profile". The transmitted email address that is visible below "Personal data".
If this address does not match to one from your invitation, please ask for the invitation to be resend to the correct email address.
For HPC accounts that are managed through the HPC portal, there is no password. Access to the HPC systems is by SSH keys only, which have to be uploaded to the portal. More information is available on generating SSH keys and uploading SSH keys.
It always takes a couple of hours for updated SSH keys to be propagated to all HPC systems. As the clusters are synchronized at different points in time, it may happen that one system already knows the updated key while others don't. It typically takes 2-4 hours for an updated key to be propagated to all systems.
After account creation it will take until the next morning until the account becomes usable, i.e., all file system folders are created and the Slurm database on the clusters is updated. Thus, please be patient.
For HPC portal users (i.e., who have accounts without a password), the job-specific monitoring of ClusterCockpit is only accessible via the HPC portal.
Please login to the HPC portal and follow the link to ClusterCockpit in your account details to generate a valid ClusterCockpit session. Sessions are valid for several hours/days.
I manage a Tier3-project in the HPC portal. Which of the project categories is the correct one for the new account I want to add?
A description of the different project categories can be found here.
Batch system Slurm#
All computational intensive work has to be submitted to the cluster nodes via the batch system (Slurm), which handles the resource distribution among different users and jobs according to a priority scheme.
For general information about how to use the batch system, see Slurm batch system.
Example batch scripts can be found here:
The syntax of
Thus, options for
sbatch have to be given before the batch script. Arguments given after the batch script are used as arguments for the script and not for
Interactive jobs can be requested by using
salloc and specifying the respective options on the command line.
They are useful for testing or debugging of your application.
More information on interactive jobs is available here:
Settings from the calling shell (e.g. loaded module paths) will automatically be inherited by the interactive job. To avoid issues in your interactive job, purge all loaded modules via
module purge before issuing the
See our documentation under attach to a running job.
Attaching to a running job can be used as an alternative to connecting to the node via SSH.
If you have multiple GPU jobs running on the same compute node, attaching via
srun is the only way to see the correct GPU for a specific job.
module command cannot be found that usually means that you did not invoke the bash shell with the option
-l (lower case L) for a login shell.
Thus, job scripts, etc. should always start with
For the A100 GPU with 80 GB memory, use
-C a100_80 in the Slurm script. Alternatively, use
-C a100_40 for the A100 GPUs with 40 GB memory.
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
By default compute nodes cannot access the internet directly. This will result in connection timeouts.
To circumvent this you have to configure a proxy server. Enter the following command either in an interactive job or add them to your job script:
Some applications may expect the variables in capital letters instead:
Otherwise, CUDA or CUDA related libraries could be missing. Ensure you have the required
module(s) in the correct version loaded, like
You are missing some configuration of conda.
After loading the
conda info that the path
is included in the output.
File systems / data storage#
See sharing data for details.
Each node has at least 1.8 TB of local SSD capacity for temporary files under
If supported by the application, use containerized formats (e.g. HDF5) or file-based databases. Otherwise, pack your
files into an archive (e.g. tar with optional compression) and use node-local storage that is accessible via
Not only may efficient file operations speed up your own code (if file I/O is what you must do); they will also reduce the burden on shared file servers and thus leave more performance headroom for other users of the resource. Hence, it is a matter of thoughtfulness to optimize file accesses even if your performance gain is marginal.
Metadata comprises all the bookkeeping information in a file system: file sizes, permissions, modification and access times, etc. A workload that, e.g., opens and closes files in rapid succession leads to frequent metadata accesses, putting a lot of strain on any file server infrastructure. This is why a small number of users with inappropriate workload can slow down file operations to a crawl for everyone. Note also that especially parallel file systems are ill-suited for metadata-heavy operations.
In a parallel file system (PFS), data is distributed not only across several disks but also multiple servers in order to increase the data access bandwidth. Most PFS's are connected to the high-speed network of a cluster, and aggregated bandwidths in the TByte/s range are not uncommon. High bandwidth can, however, only be obtained with large files and streaming access.
For information on how to use a parallel file system on our clusters, please read our documentation on Parallel file system
Different file systems have different features; for example, a central NFS server has massive bytes for the buck but limited data bandwidth, while a parallel file system is much faster but smaller and usually available to one cluster only. A node-local SSD, one the other hand, has the advantage of very low latency but it cannot be accessed from outside a compute node.
For further information see File systems.
See File systems for an overview of available storage locations and their properties.
Modern multicore systems have a strong topology, i.e., groups of hardware threads share different resources such as cores, caches, and memory interfaces. Many performance features of parallel programs depend on where their threads and processes are running in the machine. This makes it vital to bind these threads and processes to hardware threads so that variability is reduced and resources are balanced.
Simultaneous multi-threading (SMT) allows a CPU core to run more than one software thread at the same time. These "hardware threads" a.k.a. "virtual cores" share almost all resources. The purpose of this feature is to make better use of the execution units within the core. It is rather hard to predict the benefit of SMT for real applications, so the best strategy is to try it using a well-designed, realistic benchmark case.
SMT is disabled on almost all NHR@FAU systems.