Meggie#

The FAU's Meggie cluster (manufacturer: Megware) is a high-performance compute resource with high-speed interconnect. It is intended for distributed-memory (MPI) or hybrid parallel programs with medium to high communication requirements.

728 compute nodes, each with two Intel Xeon E5-2630v4 "Broadwell" chips (10 cores per chip) running at 2.2 GHz with 25 MB Shared Cache per chip and 64 GB of RAM.
2 frontend nodes with the same CPUs as the compute nodes but 128 GB of RAM.
Lustre-based parallel filesystem with a capacity of almost 1 PB and an aggregated parallel I/O bandwidth of > 9000 MB/s.
Intel Omni-Path interconnect with up to 100 GBit/s bandwidth per link and direction.
Measured LINPACK performance of ~481 TFlop/s.

The name "Meggie" is a play with the name of the manufacturer.

Meggie is a system that is designed for running parallel programs using significantly more than one node. Jobs with less than one node are not supported by RRZE and are subject to be killed without notice.

Access to the machine#

Meggie is a Tier3 resource serving FAU's basic needs. Therefore, NHR accounts are not enabled by default.

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, Meggie can be accessed via SSH by:

ssh meggie.rrze.fau.de

You will then be redirected to one of the login nodes.

Software environment#

Meggie runs AlmaLinux 8 that is binary compatible with Red Hat Enterprise Linux 8.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Gromacs#

We provide Gromacs versions without and with PLUMED. Gromacs (and PLUMED) are built using Spack.

Gromacs often delivers the most economic performance if GPGPUs are used. Thus the Alex cluster might be a better choice.

If running on Meggie, it is mandatory in most cases to optimize the number of PME processes experimentally. pme_tune REQUIRES FURTHER WORK AS A NON-MPI BINARY HAS TO BE USED

TODO: How to exactly run gmx pme_tune ...

Do not start gmx mdrun with the option -v. The verbose output will only create extra large Slurm stdout files and your jobs will suffer if the NFS servers have high load. There is also only very limited use to see in the stdout all the time when the job is expected to reach the specified number of steps.

Parallel file system `$FASTTMP`#

Due to numerous hardware- and software-problems, the parallel file system $FASTTMP on Meggie was shutdown in December 2022.

Batch processing#

Compute resources are controlled through the batch system Slurm. The front ends may only be used for compiling and very short serial test runs.

The granularity of batch allocations are complete nodes, i.e. nodes are never shared. The following queues are available on this cluster:

Partition	min -- max walltime	min -- max cores	availability	comments
`devel`	0 -- 01:00:00	1 -- 8	all users	higher priority
`work`	0 -- 24:00:00	1 -- 64	all users	default partition
`big`	0 -- 24:00:00	1 -- 256	special users	available on request only

To take advantage of the higher priority devel partition, you have to explicitly specify this in your job script via --partition=devel.

Eligible jobs in the devel and work partitions will automatically take advantage of the nodes reserved for short running jobs.

Interactive jobs#

Interactive job (single-node)#

Interactive jobs can be requested by using salloc instead of sbatch and specifying the respective options on the command line.

The following will give you an interactive shell on one node for one hour:

salloc -N 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Interactive job (multi-node)#

Interactive jobs can be requested by using salloc instead of sbatch and specifying the respective options on the command line.

The following will give you four nodes with an interactive shell on the first node for one hour:

salloc -N 4 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Batch jobs#

MPI parallel job (single-node)#

In this example, the executable will be run on one node, using 20 MPI processes, i.e. one per physical core.

#!/bin/bash -l
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=20
#SBATCH --time=01:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV
module load XXX 

srun ./mpi_application

OpenMP job (single-node)#

In this example, the executable will be run using 20 OpenMP threads (i.e. one per physical core) for a total job walltime of 1 hour.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=01:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV 
module load XXX 

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 
./openmpi_application

Hybrid OpenMP/MPI job (single-node)#

In this example, the executable will be run using 2 MPI processes with 10 OpenMP threads (i.e. one per physical core) for a total job walltime of 1 hour.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH --time=1:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV 
module load XXX 

# set number of threads to requested cpus-per-task export 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 
srun ./hybrid_application

MPI parallel job (multi-node)#

In this example, the executable will be run on four nodes, using 20 MPI processes per node, i.e. one per physical core.

#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=20
#SBATCH --time=1:0:0
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV 
module load XXX 

srun ./mpi_application

Hybrid OpenMP/MPI job (multi-node)#

In this example, the executable will be run using on four nodes with 2 MPI processes per node and 20 OpenMP threads each (i.e. one per physical core) for a total job walltime of 1 hour.

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables: OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

#!/bin/bash -l
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH --time=01:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV 
module load XXX 

# set number of threads to requested cpus-per-task export 
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 
srun ./hybrid_application

Further Information#

Intel Xeon E5-2630v4 "Broadwell" Processor#


Clock speed	Base: 2.2 GHz, Turbo (1 core): 3.1 GHz, Turbo (all cores): 2.4 GHz
Number of cores	10 per socket
L1 cache	32 KiB per core (private)
L2 cache	256 KiB per core (private)
L3 cache	2.5 MiB per core (shared by all cores)
Peak performance @ base frequency	35.2 GFlop/s per core (16 flops/cy)
Supported SIMD extension	AVX2 with FMA
STREAM triad bandwidth per socket	53.5 GB/s (standard stores; corrected for write-allocate transfers)

Full technical details about the Xeon E5-2630v4 processor.

Intel Omni-Path Interconnect#

Omni-Path is essentially Intel's proprietary implementation of "Infiniband", after they acquired the Infiniband-part of QLogic. It shares most of the features and shortcomings of QLogic-based Infiniband networks.

Each node in Meggie has a 100 GBit Omni-Path-card, and is connected to a 100 GBit switch. However, the backbone of the network is not fully non-blocking: On each leaf-switch, 32 of the 48 ports are used for compute nodes, and 16 ports are used for the uplink, meaning there is a 1:2 blocking on the backbone. As a result, if the nodes of your jobs are not all connected to the same switch, you may notice significant performance fluctuations due to the oversubscribed network. The batch system tries to run jobs on the same leaf switch if possible, but for obvious reasons that is not always possible, and for jobs utilizing more than 32 nodes is straight out impossible.

Meggie#

Access to the machine#

Software environment#

Gromacs#

Parallel file system $FASTTMP#

Batch processing#

Interactive jobs#

Interactive job (single-node)#

Interactive job (multi-node)#

Batch jobs#

MPI parallel job (single-node)#

OpenMP job (single-node)#

Hybrid OpenMP/MPI job (single-node)#

MPI parallel job (multi-node)#

Hybrid OpenMP/MPI job (multi-node)#

Further Information#

Intel Xeon E5-2630v4 "Broadwell" Processor#

Intel Omni-Path Interconnect#

Parallel file system `$FASTTMP`#