Alex#
Alex is a GPU cluster with Nvidia A40 and A100 GPGPUs and AMD Milan ("Zen3") processors. The cluster is accessible for NHR users and on request for Tier3 users.
GPU type (memory) | # nodes (# GPUs) | CPUs and # cores per node | main memory per node | node-local SSD | Slurm partition |
---|---|---|---|---|---|
8 x Nvidia A40 (40 GB DDR6) | 44 (352) | 2 x AMD EPYC 7713 ("Milan", "Zen3"), 2 x 64 cores @2.0 GHz | 512 GB | 7 TB | a40 |
8 x Nvidia A100 (40 GB HBM2) | 20 (160) | 2 x AMD EPYC 7713 ("Milan", "Zen3"), 2 x 64 cores @2.0 GHz | 1 TB | 14 TB | a100 |
8 x Nvidia A100 (80 GB HBM2) | 18 (144) | 2 x AMD EPYC 7713 ("Milan", "Zen3"), 2 x 64 cores @2.0 GHz | 2 TB | 14 TB | a100 |
The login nodes alex[12]
have 2 x AMD EPYC 7713 ("Milan", "Zen3") processors, but no GPUs. See Further information for more technical details about the cluster.
Accessing Alex#
FAU HPC accounts do not have access to Alex by default. Request access by filling out the form
https://hpc.fau.de/tier3-access-to-alex/
See configuring connection settings or SSH in general for configuring your SSH connection.
If successfully configured, Alex can be accessed via SSH by:
You will then be redirected to one of the login nodes.
Software#
Alex runs AlmaLinux 8 that is binary compatible with Red Hat Enterprise Linux 8.
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs
module.
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
Best practices, known issues#
Specific applications:
Machine learning frameworks:
Debugger:
Python, conda, conda environments#
Through the python
module, an Conda installation is available.
See our Python documentation for usage, initialization, and working with conda environments.
MKL#
Intel MKL might not deliver optimal performance on Alex's AMD processors.
In previous versions of Intel MKL, setting the environment variables MKL_DEBUG_CPU_TYPE=5
and MKL_CBWR=AUTO
improved performance on AMD
processors.
This no longer works with recent MKL versions, see also
https://github.com/tensorflow/tensorflow/issues/49113 and
https://danieldk.eu/Posts/2020-08-31-MKL-Zen.html. NHR@FAU does not
promote these workarounds; however, if you nevertheless follow them by
setting LD_PRELOAD
do not forget to still set MKL_CBWR=AUTO
.
Compiler#
For a general overview about compilers, optimizations flags, and targeting a certain CPU micro-architecture see the compiler documentation.
CPU#
Alex's front-ends and nodes have CPUs of the Zen3 micro-architecture.
For Intel oneAPI and Classic compilers use -mavx2 -mfma
on Alex
When using Intel oneAPI or Classic compilers on Alex we recommend to use -mavx2 -mfma
instead of -march=native
or -xHost
as the latter two options might generate non-optimal code for AMD CPUs.
The following table shows the compiler flags for targeting Alex's Zen3 CPUs:
target CPU | GCC/LLVM | Intel oneAPI/Classic | NVHPC |
---|---|---|---|
auto detect | -march=native |
not recommended | -tp=native |
Zen3 | -march=znver3 |
-mavx2 -mfma |
-tp=zen3 |
If your GCC/LLVM compilers are to old to recognize znver3
and you cannot switch to a newer compiler you can try znver2
or znver1
as target architecture.
GPU#
With nvcc
you can target a specific GPU with the -gencode
flag.
It is possible to specify -gencode
multiple times, generating code for multiple
targets in one binary.
For more details, see Nvidia CUDA compilers
documentation.
card | compute capability | functional capability (FA) | virtual architecture (VA) | NVCC flags |
---|---|---|---|---|
A100 | 8.0 | sm_80 |
compute_80 |
-gencode arch=compute_80,code=sm_80 |
A40 | 8.6 | sm_86 |
compute_86 |
-gencode arch=compute_86,code=sm_86 |
Multi-Process Service (MPS daemon)#
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API).
The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.
Usage of MPS requires a certain preparation and is documented under Multi-Process Service.
Filesystems#
On all front ends and nodes the filesystems $HOME
, $HPCVAULT
, and $WORK
are mounted.
For details see the filesystems documentation.
Node local NVMe SSD $TMPDIR
#
Data stored on $TMPDIR
will be deleted when the job ends.
Each cluster node has a local NVMe SSD that is reachable under
$TMPDIR
.
For more information on how to use $TMPDIR
see:
- general documentation of
$TMPDIR
, - staging data, e.g. to speed up training,
- sharing data among jobs on a node.
The SSD's capacity is 7 TB for a40
partition nodes and 14 TB for a100
partition nodes.
The storage space is shared among all jobs on a node.
Hence, you might not have access to the full capacity of the SSD.
Fast NVMe storage#
Alex is connected to our NVMe Lustre (anvme) storage, see workspaces for using it.
Batch processing#
Resources are controlled through the batch system Slurm. The front ends should only be used for compiling.
For each batch job you have to specify the number of GPUs you want to use. With each GPU you get a corresponding share of the host's resources like CPU cores and memory.
For single-node jobs, the compute nodes are shared with jobs potentially from other people. However, requested GPUs and associated resource on the host are always granted exclusively.
Multi-node jobs are only available on request for NHR projects by contacting hpc-support@fau.de. Your application must be able to use multiple nodes and efficiently utilize the available GPU. Nodes in multi-node jobs are allocated exclusively and not shared with other jobs.
Available partitions and their properties:
partition | min – max walltime | GPU type (GPU memory) | min – max GPUs | CPU cores per GPU | host memory per GPU | Slurm options |
---|---|---|---|---|---|---|
a40 |
0 – 24:00:00 | Nvidia A40 (40 GB DDR6) | 1 – 8 | 16 | 60 GB | --gres=gpu:a40:# (1) |
a100 |
0 – 24:00:00 | Nvidia A100 (40 GB HBM2) | 1 – 8 | 16 | 120 GB | --gres=gpu:a100:# (1) |
a100 |
0 – 24:00:00 | Nvidia A100 (80 GB HBM2) | 1 – 8 | 16 | 120 GB | --gres=gpu:a100:# -C a100_80 (1) |
(1) Replace #
with the number of GPUs you want to request.
Interactive job#
Interactive jobs can be requested by using salloc
and specifying the respective
options on the command line.
The environment from the calling shell, like loaded modules, will be inherited by the interactive job.
Interactive job (single GPU)#
The following will allocate an interactive shell on a node and request one
A40 GPU (--gres=gpu:a40:1
) for one hour (--time=1:0:0
):
Batch job script examples#
The following examples show general batch scripts. For the following applications we have templates for Alex:
Python (single GPU)#
In this example, we allocate 1 A40 GPU for 6 hours and the corresponding share of CPUs and host main memory.
When the job is started, we load the Python module and activate the conda environment we use for our Python script. After that we can execute the Python script.
#!/bin/bash -l
#
#SBATCH --gres=gpu:a40:1
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
module load python
conda activate environment-for-script
python3 train.py
MPI parallel job (single-node)#
In this example, the executable will be run using 16 MPI processes for a total job walltime of 6 hours. The job allocates 1 A40 GPU and the corresponding share of CPUs and main memory automatically.
#!/bin/bash -l
#
#SBATCH --ntasks=16
#SBATCH --gres=gpu:a40:1
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun ./application
Hybrid MPI/OpenMP job (single-node)#
Warning
In recent Slurm versions, the value of --cpus-per-task
is no longer automatically propagated to srun
, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK
.
In this example, 1 A100 GPU is allocated. The executable will be run using 2 MPI processes with 8 OpenMP threads each for a total job walltime of 6 hours. 16 cores are allocated in total and each OpenMP thread is running on a physical core.
#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:a100:1
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun ./hybrid_application
Multi-node Job (available on demand for NHR projects)#
In this case, your application has to be able to use more than one node and its corresponding GPUs at the same time. The nodes will be allocated exclusively for your job, i.e. you get access to all GPUs, CPUs and RAM of the node automatically.
Adjust the options --nodes
and--ntasks-per-node
to your application.
#!/bin/bash -l
#
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --gres=gpu:a100:8
#SBATCH --qos=a100multi
#SBATCH --time=1:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun ./application
Attach to a running job#
See the general documentation on batch processing.
Further information#
Performance#
- On 160 Nvidia A100/40GB GPGPUs, a LINPACK performance of 1.73 PFlop/s has been measured in January 2022.
- On 160 Nvidia A100/40GB plus 96 Nvidia A100/80GB GPGPUs (i.e. 32 nodes), a LINPACK performance of 2.938 PFlop/s has been measured in May 2022 resulting in place 184 of the June 2022 Top500 list and place 17 in the Green500 of June 2022.
- On 160 Nvidia A100/40GB plus 120 Nvidia A100/80GB GPGPUs (i.e. 35 nodes), a LINPACK performance of 3.24 PFlop/s has been measured in Oct. 2022 resulting in place 174 of the Nov. 2022 Top500 list and place 33 in the Green500 of Nov. 2022.
Node details#
partition | a40 |
a100 (40 GB) |
a100 (80 GB) |
---|---|---|---|
no. of nodes | 44 | 20 | 18 |
processors | 2 x AMD EPYC 7713 "Milan" | 2 x AMD EPYC 7713 "Milan" | 2 x AMD EPYC 7713 "Milan" |
Microarchitecture | Zen3 | Zen3 | Zen3 |
no. of cores | 2 x 64 = 128 | 2 x 64 = 128 | 2 x 64 = 128 |
L3 cache | 2 x 256 MB = 512 MB | 2 x 256 MB = 512 MB | 2 x 256 MB = 512 MB |
memory | 512 GB | 1 TB | 2 TB |
memory type | DDR4 | DDR4 | DDR4 |
NPS | 4 | 1 | 1 |
NUMA LDs | 8 | 8 | 8 |
Infiniband HCAs | 2 x HDR200 | 2 x HDR200 | |
network | 25 GbE | 25 GbE | 25 GbE |
local NVMe SSD | 7 TB | 14 TB | 14 TB |
GPGPUs | |||
Nvidia GPUs | A40 | A100 | A100 |
no. of GPGPUs | 8 | 8 | 8 |
memory | 48 GB | 40 GB | 80 GB |
memory type | GDDR6 | HBM2 | HBM2 |
memory bandwidth | 696 GB/s | 1,555 GB/s | 2,039 GB/s |
interconnect | HGX board with NVLink | HGX board with NVLink |
The Nvidia A40 GPGPUs have a very high single precision floating point performance (even higher than an A100) and are much less expensive than Nvidia A100 GPGPUs. All workloads which only require single precision floating point operations, like many molecular dynamics applications, thus, should target the Nvidia A40 GPGPUs.
Processor details#
Processor used:
partition | a40 , a100 |
---|---|
processor | AMD EPYC 7713 "Milan" |
Microarchitecture | Zen3 |
no. of cores (SMT threads) | 64 (128) |
SMT | disabled (1) |
max. Boost frequency | 3.675 GHz |
base frequency | 2.0 GHz |
total L3 cache | 256 MB |
memory type | DDR4 @ 3,200 MHz |
memory channels | 8 |
NPS setting | 4 |
theo. socket memory bandwidth | 204.8 GB/s |
default TDP | 225W |
(1) For security reasons SMT is disabled on Alex.
Nvidia A40 and A100 details#
Nvidia GPU | A40 | A100 (SMX) |
---|---|---|
architecture | Ampere | Ampere |
compute capability | 8.6 | 8.0 |
functional architecture | sm_86 |
sm_80 |
virtual architecture | compute_86 |
compute_80 |
memory | 48 GB | 40 GB / 80 GB |
memory type | GDDR4 | HBM2 |
ECC | enabled | disabled |
Memory bandwidth | 696 GB/s | 1,555 GB/s / 2,039 GB/s |
Interconnect interface | PCIe Gen4 31.5 GB/s (bidirectional) | NVLink: 600GB/s |
CUDA Cores (Ampere generation) | 10,752 (84 SMs) | 6,912 (108 SMs) |
RT Cores (2nd generation) | 84 | - |
Tensor Cores (3rd generation) | 336 | 432 |
FP64 TFLOPS (non-Tensor) | 0.5 | 9.7 |
FP64 Tensor TFLOPS | - | 19.5 |
Peak FP32 TFLOPS (non-Tensor) | 37.4 | 19.5 |
Peak TF32 Tensor TFLOPS | 74.8 | 156 |
Peak FP16 Tensor TFLOPS with FP16 Accumulate | 149.7 | 312 |
Peak BF16 Tensor TFLOPS with FP32 Accumulate | 149.7 | 312 |
RT Core performance TFLOPS | 73.1 | ? |
Peak INT8 Tensor TOPS | 299.3 | 624 |
Peak INT 4 Tensor TOPS | 598.7 | 1248 |
Max power consumption | 300 W | 400 W |
Price | $$ | $$$$ |
A40 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a40/proviz-print-nvidia-a40-datasheet-us-nvidia-1469711-r8-web.pdf (11/2021).
A100 data taken from https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf (11/2021).
Nvidia A40 GPGPU nodes#
All eight A40 GPGPUs of a node are connected to two PCIe switches. Thus, there is only limited bandwidth to the host system and also between the GPGPUs.
Devices of compute capability 8.6 have 2x more FP32 operations per cycle per SM than devices of compute capability 8.0. While a binary compiled for 8.0 will run as is on 8.6, it is recommended to compile explicitly for 8.6 to benefit from the increased FP32 throughput.
according to Nvidia.
GPGPU Topology of a40
nodes
Output of nvidia-smi topo -m
on a node from the a40
partition. The nodes have NPS=4.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-63 3 N/A
GPU1 SYS X SYS SYS SYS SYS SYS SYS SYS SYS 32-47 2 N/A
GPU2 SYS SYS X SYS SYS SYS SYS SYS SYS SYS 16-31 1 N/A
GPU3 SYS SYS SYS X SYS SYS SYS SYS SYS SYS 0-15 0 N/A
GPU4 SYS SYS SYS SYS X SYS SYS SYS SYS SYS 112-127 7 N/A
GPU5 SYS SYS SYS SYS SYS X SYS SYS SYS SYS 96-111 6 N/A
GPU6 SYS SYS SYS SYS SYS SYS X SYS PHB PHB 80-95 5 N/A
GPU7 SYS SYS SYS SYS SYS SYS SYS X SYS SYS 64-79 4 N/A
NIC0 SYS SYS SYS SYS SYS SYS PHB SYS X PIX
NIC1 SYS SYS SYS SYS SYS SYS PHB SYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
Nvidia A100 GPGPU nodes#
All four or eight A100 GPGPUs of a node are directly connected with each other through an NVSwitch providing 600 GB/s GPU-to-GPU bandwidth for each GPGPU.
GPGPU Topology of a100
nodes
Output of nvidia-smi topo -m
on a node from the a40
partition. The nodes have NPS=4.
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 48-63 3 N/A
GPU1 NV12 X NV12 NV12 NV12 NV12 NV12 NV12 PXB SYS SYS SYS 48-63 3 N/A
GPU2 NV12 NV12 X NV12 NV12 NV12 NV12 NV12 SYS PXB PXB SYS 16-31 1 N/A
GPU3 NV12 NV12 NV12 X NV12 NV12 NV12 NV12 SYS PXB PXB SYS 16-31 1 N/A
GPU4 NV12 NV12 NV12 NV12 X NV12 NV12 NV12 SYS SYS SYS SYS 112-127 7 N/A
GPU5 NV12 NV12 NV12 NV12 NV12 X NV12 NV12 SYS SYS SYS SYS 112-127 7 N/A
GPU6 NV12 NV12 NV12 NV12 NV12 NV12 X NV12 SYS SYS SYS PXB 80-95 5 N/A
GPU7 NV12 NV12 NV12 NV12 NV12 NV12 NV12 X SYS SYS SYS PXB 80-95 5 N/A
NIC0 PXB PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS
NIC1 SYS SYS PXB PXB SYS SYS SYS SYS SYS X PIX SYS
NIC2 SYS SYS PXB PXB SYS SYS SYS SYS SYS PIX X SYS
NIC3 SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
Name#
The name "Alex" is a play with the name of FAU's early benefactor Alexander, Margrave of Brandenburg-Ansbach (1736-1806).
Financing#
Alex has been financed by:
- German Research Foundation (DFG) as part of INST 90/1171-1 (440719683),
- NHR funding of federal and state authorities (BMBF and Bavarian State Ministry of Science and the Arts, respectively),
- seven A100 nodes are dedicated to HS Coburg as part of the BMBF proposal "HPC4AAI" within the call "KI-Nachwuchs@FH",
- one A100 nodes is financed by and dedicated to an external group from Erlangen,
- and financial support of FAU to strengthen HPC activities.