Fritz#
Fritz is a parallel CPU cluster with Intel Ice Lake and Sapphire Rapids processors, uses an InfiniBand (IB) network, and has a Lustre-based parallel filesystem accessible under $FASTTMP
.
The cluster is accessible for NHR users and on request for Tier3 users.
# nodes | CPUs and # cores per node | main memory per node | Slurm partition |
---|---|---|---|
992 | 2 x Intel Xeon Platinum 8360Y ("Ice Lake"), 2 x 36 cores @2.4 GHz | 256 GB | singlenode , multinode |
48 | 2 x Intel Xeon Platinum 8470 ("Sapphire Rapids"), 2 x 52 cores @2.0 GHz | 1 TB | spr1tb |
16 | 2 x Intel Xeon Platinum 8470 ("Sapphire Rapids"), 2 x 52 cores @2.0 GHz | 2 TB | spr2tb |
The login nodes fritz[1-4]
have 2 x Intel Xeon Platinum 8360Y ("Ice Lake") processors with 512 GB
main memory.
See Further information for more technical details about the cluster.
The remote visualization node fviz1
have 2 x Intel Xeon Platinum 8360Y ("Ice Lake") processors with 1 TB main memory, one Nvidia A16 GPU, 30 TB of local NVMe SSD storage.
See Remote visualization for using fviz1
.
Accessing Fritz#
FAU HPC accounts do not have access to Fritz by default. Request access by filling out the form
https://hpc.fau.de/tier3-access-to-fritz/
See configuring connection settings or SSH in general for configuring your SSH connection.
If successfully configured, Fritz can be accessed via SSH by:
You will then be redirected to one of the login nodes.
Software#
Fritz runs AlmaLinux 8 that is binary compatible with Red Hat Enterprise Linux 8.
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs
module.
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
Best practices, known issues#
Specific applications:
Machine learning frameworks:
Debugger:
Python, conda, conda environments#
Through the python
module, a Conda installation is available.
See our Python documentation for usage, initialization,
and working with conda environments.
Compiler#
For a general overview about compilers, optimizations flags, and targeting a certain or multiple CPU micro-architecture see the compiler documentation.
Fritz has nodes with two different CPU micro-architectures:
- Intel Ice Lake Server
- Intel Sapphire Rapids (
--partition=sp1tb
or--partition=sb2tb
)
The frontend nodes have Ice Lake Server CPUs.
Code compiled exclusively for Sapphire Rapids CPUs cannot run on Ice Lake Server CPUs.
For best performance compile your binary for the targeted nodes. If you want to use binaries optimized for both Ice Lake and Sapphire Rapids nodes either
- have two instances of your application, each compiled for the corresponding architecture or
- see the targeting multiple architectures
To compile code that runs on all nodes in Fritz, compile on the front ends and:
- let the compiler use the host system as optimization target, see compiler documentation for flags
- set Ice Lake Server as target architecture.
The following table shows the compiler flags for targeting a certain CPU micro-architecture:
CPU | GCC, LLVM, Intel OneAPI/Classic |
---|---|
all nodes | -march=icelake-server |
Ice Lake Server | -march=icelake-server |
Sapphire Rapids | -march=sapphirerapids |
Older version of the compilers might not know about sapphirerapids
or icelake-server
.
In the case you cannot switch to a newer one, you might try targeting older architectures, in decreasing order: sapphirerapids
, icelake-server
, skylake-avx512
or consult your compiler's documentation.
Filesystems#
On all front ends and nodes the filesystems $HOME
, $HPCVAULT
, and $WORK
are mounted.
For details see the filesystems documentation.
Node-local job-specific RAM disk $TMPDIR
#
Data stored on $TMPDIR
will be deleted when the job ends.
Each cluster node has a local job-specific RAM disk, reachable under $TMPDIR
. All data you store there, cuts away from the available RAM
your application can use on the specific node.
Using $TMPDIR
can be beneficial if you have at lot of fine grained I/O, e.g. frequently writing log files. In this case, I/O can be performed to files located in $TMPDIR
that are copied to $WORK
at the end of the job
Parallel filesystem $FASTTMP
#
The parallel filesystem $FASTTMP
is mounted on all frontends and cluster nodes of Fritz.
It is also accessible through Alex, but only with a lower bandwidth.
Parallel filesystem | |
---|---|
Mount point | /lustre/$GROUP/$USER/ |
Access via | $FASTTMP |
Purpose | High performance parallel I/O; short-term storage |
Capacity | 3.5 PB |
Technology | Lustre-based parallel filesystem |
Backup | No |
Data lifetime | high-watermark deletion |
Quota | number of files |
High watermark deletion
$FASTTMP
is for high-performance short-term storage only. When the filling of the filesystem exceeds a certain limit (e.g. 80%), a high-watermark deletion will be run, starting with the oldest and largest files.
Intended I/O usage
$FASTTMP
supports parallel I/O using the MPI-I/O functions and can be accessed with an aggregate bandwidth of > 20 GB/s (inside Fritz only). Use $FASTTMP
only for large files. Ideally the files are written by many nodes simultaneous, e.g., for checkpointing with MPI-IO.
$FASTTMP
is not made for handling large amounts for small files
Parallel filesystems achieve their speed by writing to multiple servers at the same time. Files are distributed in the granularity of blocks over the servers.
On $FASTTMP
a block has the a size of 1 MB.
Files smaller than 1 MB will reside only on one server. Additional overhead of the parallel filesystem causes a slower access than traditional NFS servers.
For that reason, we have set a limit on the number of files you can store there.
Batch processing#
Resources are controlled through the batch system Slurm. Do not run your applications on the front ends, they should only be used for compiling.
The granularity of batch allocations are complete nodes and that are allocated exclusively.
Available partitions and their properties:
partition | min – max walltime | # nodes per job | CPU cores per node | memory per node | Slurm options |
---|---|---|---|---|---|
singlenode (default) |
0 – 24:00:00 | 1 | 72 | 256 GB | |
multinode |
0 – 24:00:00 | 2 – 64 | 72 | 256 GB RAM | |
spr1tb |
0 – 24:00:00 | 1 – 8 | 104 | 1 TB | -p spr1tb |
spr2tb |
0 – 24:00:00 | 1 – 2 | 104 | 2 TB | -p spr2tb |
big (1) |
0 – 24:00:00 | 65 – 256 | 72 | 256 GB | -p big |
(1) Available on request only.
Interactive job#
Interactive jobs can be requested by using salloc
and specifying the respective
options on the command line.
The environment from the calling shell, like loaded modules, will be inherited by the interactive job.
Interactive single-node job#
The following will allocate one Ice Lake node (-N 1
) from the partition singlenode
(the default partition) for one hour (--time=01:00:00
) and provide you with
an interactive shell on this node:
To allocate an interactive job with the same properties but on a Sapphire Rapids
nodes you also have to specify the partition --partition=spr1tb
or --partition=spr2tb
:
Interactive multi-node job#
The following will allocate four Ice Lake nodes (-N 4
) from the multinode
partition (automatically chosen when more than one node is requested) for one hour (--time=01:00:00
) and provide you with
an interactive shell on this node:
To allocate an interactive job with the same properties but on Sapphire Rapids
nodes you also have to change the partition to spr1tb
or spr2tb
:
Batch job script examples#
MPI parallel job (single-node)#
In this example, the executable will be run on 1 node with 72 MPI processes per node.
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=72
#SBATCH --time=2:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun ./application
OpenMP job (single-node)#
For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables:OMP_PLACES=cores, OMP_PROC_BIND=true
. For more information, see e.g. the HPC Wiki.
In this example, the executable will be run using 72 OpenMP threads (i.e. one per physical core) for a total job walltime of 2 hours.
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=72
#SBATCH --time=2:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./application
Hybrid OpenMP/MPI job (single-node)#
Warning
In recent Slurm versions, the value of --cpus-per-task
is no longer automatically propagated to srun
, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK
.
In this example, the executable will be run on one node using 2 MPI processes with 36 OpenMP threads (i.e. one per physical core) for a total job walltime of 1 hours.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=36
#SBATCH --time=1:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun ./hybrid_application
MPI parallel job (multi-node)#
In this example, the executable will be run on 4 nodes with 72 MPI processes per node.
#!/bin/bash -l
#
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=72
#SBATCH --time=2:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun ./application
Hybrid OpenMP/MPI job (multi-node)#
Warning
In recent Slurm versions, the value of --cpus-per-task
is no longer automatically propagated to srun
, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK
.
In this example, the executable will be run on 4 nodes using 2 MPI processes per node with 36 OpenMP threads each (i.e. one per physical core) for a total job walltime of 2 hours.
#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=36
#SBATCH --time=2:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun ./hybrid_application
Attach to a running job#
See the general documentation on batch processing.
Remote visualization#
fviz1
can be used for remote visualization with VirtualGL.
On the node you can use visualization tools and access large datasets directly from the mounted filesystems $HOME
, $HPCVAULT
, $WORK
, and $FASTTMP
.
fviz1
contains one Nvidia A16 GPU, partitioned into 4 virtual GPUs, allowing up to 4 users to use it at the same time. Each GPU has 16 GB RAM available to it. You cannot request more than one GPU at the same time.
The GPU(s) in fviz1
are unsuitable for machine learning applications and it is not permitted to use them for that.
The visualization node and this documentation are still work in progress. You should expect to experience problems. Feel free to report them.
Prerequisites:
- You can connect to Fritz via SSH.
- You can submit jobs on Fritz.
- You need a VNC client/viewer on your local machine. We recommend TurboVNC, but other VNC clients/viewers may also work (though probably with worse performance).
You need to be able to reach one port on
fviz1
network-wise, either by being directly within the university network, using VPN, or by doing an SSH tunnel
For remote visualization, VirtualGL is used. On the visualization node, an X server is started, that has access to the hardware acceleration of the GPUs. The resulting display output is then grabbed by a VNC server, and transported to the client through the VNC protocol. This means that you will need a VNC client/viewer on your end to display what the visualization node sends to you.
Access to the visualization node is through the batch system. We have prepared a few scripts to make requesting the visualization node and running the necessary job script easier: You should be able to simply call
wherehours
/minutes
/seconds
specify the time limit for the job, e.g. /apps/virtualgl/submitvirtualgljob.sh --time=2:0:0
for a two hour job.
This will queue a job that will start the server side of VirtualGL, wait for it to run, and then display instructions on how to connect to it. Here is an example output from such a job - shortened and with relevant parts highlighted:
b999dc99@fritz:~$ /apps/virtualgl/submitvirtualgljob.sh --time=2:0:0
Job has been queued, with JobID 4711 - waiting for it to start...
..............................Job seems to have started!
### Starting TaskPrologue of job 4711 on fviz1 at Fri Nov 24 01:48:47 CET 2023
[...]
Desktop 'TurboVNC: fviz1.nhr.fau.de:10 (b999dc99)' started on display fviz1.nhr.fau.de:10
One-Time Password authentication enabled. Generating initial OTP ...
Full control one-time password: 123456789
Run '/opt/TurboVNC/bin/vncpasswd -o' from within the TurboVNC session or
'/opt/TurboVNC/bin/vncpasswd -o -display :10' from within this shell
to generate additional OTPs
[...]
When you're done, don't forget to properly end the VNC session, or cancel it with
scancel 4711
If you are within the university network or connected via VPN, you can simply tell your VNC client/viewer to connect to the display in the message (in the example above that would be fviz1.nhr.fau.de:10
). If you're not, you need to create an SSH tunnel. For this we recommend to use the dialog server csnhr
. To calculate the port, you need to add 5900 to the number after the colon. In the above example, that is 10, so the correct port would be 5910. Start an SSH tunnel with something like ssh -L 5910:fviz1.nhr.fau.de:5910 yourusername@csnhr.nhr.fau.de
. You can then direct you VNC client/viewer to connect to the display localhost:10
.
When asked for the password, enter the password from the line Full control one-time password (in the example above, that would be 123456789).
You should now see a remote desktop running on fviz1
. Any applications you start there should see that there is a graphics card that supports 3D acceleration available, and be generally usable (although you should not expect miracles). If things do not run smoothly, make sure to use TurboVNC as a client.
When you're done, please exit the remote desktop properly by clicking log out on the remote desktop, and don't just close the VNC window, as that would leave the desktop running on the server until the time limit is reached, blocking the GPU for use by other users.
Further information#
Performance#
Measured LINPACK performance of
- 1.84 PFlop/s on 512 nodes in April 2022,
- 2.233 PFlop/s on 612 nodes in May 2022 resulting in place 323 of the June 2022 Top500 list, and
- 3.578 PFlop/s on 986 nodes in November 2022 resulting in place 151 of the November 2022 Top500 list.
Nodes and processor details#
Nodes overview:
partition | singlenode , multinode |
spr1tb |
spr2tb |
---|---|---|---|
no. of nodes | 992 | 48 | 16 |
processors | 2 x Intel Xeon Platinum 8360Y | 2 x Intel Xeon Platinum 8470 | 2 x Intel Xeon Platinum 8470 |
Microarchitecture | Ice Lake | Sapphire Rapids | Sapphire Rapids |
no .of cores | 2 x 36 = 72 | 2 x 52 = 104 | 2 x 52 = 104 |
base frequency | 2.4 GHz | 2.0 GHz | 2.0 GHz |
L3 cache | 2 x 54 MB = 108 MB | 2 x 105 MB = 210 MB | 2 x 105 MB = 210 MB |
memory | 256 GB | 1 TB | 2 TB |
NUMA LDs | 4 | 8 | 8 |
All nodes have SMT disabled and sub-NUMA clustering enabled.
Processors used:
partition | singlenode , multinode |
spr1tb , spr2tb |
---|---|---|
processor | Intel Xeon Platinum 8360Y | Intel Xeon Platinum 8470 |
Microarchitecture | Ice Lake | Sapphire Rapids |
cores (SMT threads) | 36 (72) | 52 (104) |
SMT | disabled | disabled |
max Turbo frequency (1) | 3.5 GHz | 3.8 GHz |
base frequency (1) | 2.4 GHz | 2.0 GHz |
L1 data cache per core | 48 KB | 48 KB |
L2 cache per core | 1280 KB | 2048 KB |
last level cache (L3) | 54 MB | 105 MB |
memory channels/type p. node | 16 x 16 GB DDR4-3200 | 16 x 64/128 GB DDR5-4800 |
no. of UPI links | 3 | 4 |
TDP | 250 W | 350 W |
instruction set extensions | SSE, AVX, AVX2, AVX-512 | SSE, AVX, AVX2, AVX-512 |
no. of AVX-512 FMA units | 2 | 2 |
Intel ARK | link | link |
(1) Turbo and base frequency can be lower when executing AVX(2) and AVX-512 instructions.
Network topology#
Fritz uses blocking HDR100 Infiniband with up to 100 GBit/s bandwidth per link and direction. There are islands with 64 nodes (i.e. 4.608 cores). The blocking factor between islands is 1:4.
Fritz uses unmanaged 40 port Mellanox HDR switches. 8 HDR200 links per edge switch are connected to the spine level. Using splitter cables, 64 compute nodes are connected with HDR100 per each edge switch. This results in a 1:4 blocking of the fat tree. Each island with 64 nodes has a total of 4.608 cores. Slurm is aware of the topology, but minimizing the number of switches per jobs does not have a high priority.
Direct liquid cooling (DLC) of the compute nodes#
Ice Lake nodes:
- Introducing the Intel Server System D50TNP Family
- Intel Server D50TNP Family Integration and Service Guide
- Intel Server D50TNP family - Configuration Guide
- Intel Server D50TNP family - Technical Product Specification
Sapphire Rapids nodes:
- Intel Server System D50DNP1MHCPLC Compute Module
- Intel Server D50DNP Family Integration and Service Guide 1.0
- Intel Server D50DNP Family Technical Product Specification 1.0
Name#
The name "Fritz" is a play with the name of FAU's founder Friedrich, Margrave of Brandenburg-Bayreuth (1711-1763).
Financing#
Fritz has been financed by:
- German Research Foundation (DFG) as part of INST 90/1171-1 (440719683)
- NHR funding of federal and state authorities (BMBF and Bavarian State Ministry of Science and the Arts, respectively)
- eight of the Sapphire Rapid nodes are dedicated to HS Coburg as part of the BMBF proposal "HPC4AAI" within the call "KI-Nachwuchs@FH"
- financial support of FAU to strengthen HPC activities