Helma#

This page is currently under construction.

Helma is currently a GPU cluster with NVIDIA H100 GPGPUs and AMD Genoa ("Zen4") host CPUs. The cluster is currently in friendly user mode and only accessible upon invitation.

GPU type (memory)	# nodes (# GPUs)	CPUs and # cores per node	main memory per node	node-local SSD	Slurm partition
4 x Nvidia H100 (94 GB HBM2e)	96 (384)	2 x AMD EPYC 9554 ("Genoa", "Zen4"), 2 x 64 cores @3.1 GHz	768 GB	15 TB	h100
4 x Nvidia H200 (141 GB HBM3e)	96 (384)	2 x AMD EPYC 9554 ("Genoa", "Zen4"), 2 x 64 cores @3.1 GHz	768 GB	15 TB	h200

For more information on "Helma", please read our press release.

Accessing Helma#

FAU HPC accounts do not have access to Helma by default. Access is limited to experienced users. Click here if you want to apply for access to Helma.

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, Helma can be accessed via SSH by:

ssh helma.nhr.fau.de

You will then be redirected to one of the login nodes.

Software#

Helma runs AlmaLinux 9 that is binary compatible with Red Hat Enterprise Linux 9.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Best practices, known issues#

Machine learning frameworks:

Debugger:

GDB

Python, conda, conda environments#

Through the python module, an Conda installation is available. See our Python documentation for usage, initialization, and working with conda environments.

Info

Due to restriction in inodes on Helma, it is highly recommended to use apptainer to run your jobs.

MKL#

Intel MKL might not deliver optimal performance on Helma's AMD processors. In previous versions of Intel MKL, setting the environment variables MKL_DEBUG_CPU_TYPE=5 and MKL_CBWR=AUTO improved performance on AMD processors.

This no longer works with recent MKL versions, see also https://github.com/tensorflow/tensorflow/issues/49113 and https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html. NHR@FAU does not promote these workarounds; however, if you nevertheless follow them by setting LD_PRELOAD do not forget to still set MKL_CBWR=AUTO.

Compiler#

For a general overview about compilers, optimizations flags, and targeting a certain CPU micro-architecture see the compiler documentation.

Filesystems#

In contrast to other clusters only $HOME, hnvme workspaces, and $TMPDIR are mounted on Helma compute nodes. Frontends have all HPC filesystems mounted, an overview can be found here.

Node local NVMe SSD `$TMPDIR`#

Data stored on $TMPDIR will be deleted when the job ends.

Each cluster node has a local NVMe SSD that is reachable under $TMPDIR.

For more information on how to use $TMPDIR see:

general documentation of $TMPDIR,
staging data, e.g. to speed up training,
sharing data among jobs on a node.

The SSD's capacity is 15 TB for h100 partition nodes. The storage space is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.

Fast NVMe storage#

Helma is connected to our NVMe Lustre (hnvme) storage, see workspaces for using it.

Note: Currently, only the /hnvme workspaces are mounted on the compute nodes, with /anvme workspaces no longer being mounted. Please verify the location of your workspace using the ws_list command. When creating a workspace on Helma it is automatically created under /hnvme.

Batch processing#

Resources are controlled through the batch system Slurm. The front ends should only be used for compiling.

For each batch job you have to specify the number of GPUs you want to use. With each GPU you get a corresponding share of the host's resources like CPU cores and memory.

For single-node jobs, the compute nodes are shared with jobs potentially from other people. However, requested GPUs and associated resource on the host are always granted exclusively.

Multi-node jobs are only available on request for NHR projects by contacting hpc-support@fau.de. Your application must be able to use multiple nodes and efficiently utilize the available GPU. Nodes in multi-node jobs are allocated exclusively and not shared with other jobs.

Available partitions and their properties:

partition	min – max walltime	GPU type (GPU memory)	min – max GPUs	CPU cores per GPU	host memory per GPU	Slurm options
`preempt`(1)	0 – 24:00:00	Nvidia H100( 94 GB HBMe)	1 – 4	32	192 GB	`--gres=gpu:h100:#` (2)
`h100`	0 – 24:00:00	Nvidia H100( 94 GB HBMe)	1 – 4	32	192 GB	`--gres=gpu:h100:#` (2)
`h200`	0 – 24:00:00	Nvidia H200( 141 GB HBMe)	1 – 4	32	192 GB	`--gres=gpu:h200:#` (2)

(1) If no partition is specified, jobs are placed into the preempt partition: - such jobs can run on any node - are limited to a single node (i.e. up to 4 GPUs) - they are guaranteed to run for at least 2h; after that they may be signaled with 900s grace time to be killed (for freeing resources for “regular” partitions)

(2) Replace # with the number of GPUs you want to request.

Interactive job#

Interactive jobs can be requested by using salloc and specifying the respective options on the command line.

The environment from the calling shell, like loaded modules, will be inherited by the interactive job.

Interactive job (single GPU)#

The following will allocate an interactive shell on a node and request one H100 GPU (--gres=gpu:h100:1) for one hour (--time=1:0:0):

salloc --gres=gpu:h100:1 --time=1:00:00

Batch job script examples#

The following examples show general batch scripts. For the following applications we have templates for Alex, which should work for Helma at least if only a single GPU is used:

Python (single GPU)#

In this example, we allocate 1 H100 GPU for 6 hours and the corresponding share of CPUs and host main memory.

When the job is started, we load the Python module and activate the conda environment we use for our Python script. After that we can execute the Python script.

#!/bin/bash -l
#
#SBATCH --gres=gpu:h100:1
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

module load python
conda activate environment-for-script

python3 train.py

MPI-parallel job (single-node)#

In this example, the executable will be run using 32 MPI processes for a total job walltime of 6 hours. The job allocates 1 H100 GPU and the corresponding share of CPUs and main memory automatically.

#!/bin/bash -l
#
#SBATCH --ntasks=32
#SBATCH --gres=gpu:h100:1
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun ./application

Hybrid MPI/OpenMP job (single-node)#

Warning

In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, 1 H100 GPU is allocated. The executable will be run using 2 MPI processes with 16 OpenMP threads each for a total job walltime of 6 hours. 32 cores are allocated in total and each OpenMP thread is running on a physical core.

#!/bin/bash -l

#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:h100:1
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun ./hybrid_application

Multi-node Job#

In this case, your application has to be able to use more than one node and its corresponding GPUs at the same time. The nodes will be allocated exclusively for your job, i.e. you get access to all GPUs, CPUs and RAM of the node automatically.

Adjust the options --nodes and--ntasks-per-node to your application.

#!/bin/bash -l
#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=32
#SBATCH --gres=gpu:h100:4
#SBATCH --partition=h100
#SBATCH --time=1:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun ./application

Attach to a running job#

See the general documentation on batch processing.

Further information#

Performance#

On 384 Nvidia H100/94GB GPGPUs, a LINPACK performance of 16.94 PFlop/s has been measured in October 2024.

Node details#

partition	`h100`	`h200`	`cpu`
no. of nodes	96	96	?
processors	2 x AMD EPYC 9554 "Genoa"	2 x AMD EPYC 9554 "Genoa"
Microarchitecture	Zen4	Zen4
no. of cores	2 x 64 = 128	2 x 64 = 128	2x 192 = 384
L3 cache	2 x 256 MB = 512 MB	2 x 256 MB = 512 MB
memory	768 GB	768 GB
memory type	DDR5	DDR5
NPS	4	4
NUMA LDs	8	8
Infiniband HCAs	4 x HDR200	4 x HDR200
network	25 GbE	25 GbE	25 GbE
local NVMe SSD	15 TB	15 TB
Nvidia GPUs	H100	H200	-/-
no. of GPGPUs	4	4	-/-
VRAM	96	141	-/-
VRAM type	HBM2e	HBM3e	-/-
VRAM bandwidth
interconnect	HGX board with NVLink	HGX board with NVLink	-/-

Processor details#

Processor used:

partition	`h100`, `h200`, `preempt`
processor	AMD EPYC 9554 "Genoa"
Microarchitecture	Zen4
no. of cores (SMT threads)	64 (128)
SMT	disabled (1)
max. Boost frequency	3.75 GHz
base frequency	3.1 GHz
total L3 cache	256 MB
memory type	DDR5 @ 4,800 MT/s
memory channels	8
NPS setting	4
theo. socket memory bandwidth	460.8 GB/s
default TDP	320W

(1) For security reasons SMT is disabled on Helma.

Name#

The cluster is named after Wilhelmine, Margravine of Brandenburg-Bayreuth (1709–1758). Together with her husband Friedrich (1711–1763) she founded the University of Erlangen in 1743.

Financing#

The Helma-cluster has been financed by:

State of Bavaria through its High-Tech Agenda (HTA),
NHR funding of federal and state authorities (BMBF and Bavarian State Ministry of Science and the Arts, respectively),
Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU),
University of Technology Nürnberg (UTN),
University of Applied Sciences Hof, Katholische Universität Eichstätt-Ingolstadt and University of Applied Sciences Landshut.