Woody#

The Woody cluster consists of node with different generations of Intel CPUs and is used for throughput computing.

Hostnames	# nodes	CPUs and # cores per node	Main memory per node	node-local SSD	Slurm partition	Slurm constraint
`w12xx`,`w13xx`	64	1 x Intel Xeon E3-1240 v5 ("Skylake"), 4 cores @3.5 GHz	32 GB	1 TB	`work`	`sl`
`w14xx`,`w15xx`	112	1 x Intel Xeon E3-1240 v6 ("Kaby Lake"), 4 cores @3.7 GHz	32 GB	900 GB	`work`	`kl`
`w22xx` - `w25xx` (1)	110	2 x Intel Xeon Gold 6326 ("Ice Lake"), 2 x 16 cores @2.9 GHz	256 GB	1.8 TB	`work`	`icx`

(1) 40 nodes are financed by and dedicated to ECAP. 40 nodes are financed by and belong to a project from the Physics department.

Accessing woody#

Woody is a Tier3 resource serving FAU's basic needs. Therefore, NHR accounts are not enabled by default.

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, Woody can be accessed via SSH by:

ssh woody.nhr.fau.de

You will then be redirected to one of the login nodes.

Software#

Woody runs AlmaLinux 8 that is binary compatible with Red Hat Enterprise Linux 8.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Python, conda, conda environments#

Through the python module, a Conda installation is available. See our Python documentation for usage, initialization, and working with conda environments.

Compiler#

For a general overview about compilers, optimizations flags, and targeting a certain CPU micro-architecture see the compiler documentation.

The CPU types on the frontend node and in the partitions are different. The compilation flags should be adjusted according to the partition you plan to run your code on. See the following table for details.

Only nodes with Slurm constraint icx support AVX512.

Code compiled with AVX512 support will not run on other nodes and result in Illegal Instruction errors.
The frontend nodes support AVX512. Code compiled on these nodes with -march=native or -xHost will require AVX512.
Regarding AVX512 performance, see our optimization notices.

The following table shows the compiler flags for targeting Woody's CPUs:

Slurm constraint	microarchitecture	GCC/LLVM	Intel oneAPI/Classic
none	Skylake, Kaby Lake, Ice Lake	`-march=skylake` or `-march=x86-64-v3`	`-march=skylake` (1)
`icx`	Ice Lake	`-march=icelake-server`	`-march=icelake-server`
`sl`	Skylake	`-march=skylake`	`-march=skylake`
`kl`	Kaby Lake	`-march=skylake`	`-march=skylake`

(1) See Targeting multiple architectures on how to generate on optimized code for each micro-architecture.

Filesystems#

On all front ends and nodes the filesystems $HOME, $HPCVAULT, and $WORK are mounted. For details see the filesystems documentation.

Node local SSD `$TMPDIR`#

Data stored on $TMPDIR will be deleted when the job ends.

Each cluster node has a local SSD that is reachable under $TMPDIR.

For more information on how to use $TMPDIR see:

general documentation of $TMPDIR,
staging data, e.g. to speed up training,
sharing data among jobs on a node.

The storage space of the SSD is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.

Batch processing#

Resources are controlled through the batch system Slurm.

Only single node jobs are allowed.

Compute are shared, however, requested resources are always granted exclusively. The granularity of batch allocations are individual cores. For each requested core 7.75 GB of main memory are allocated.

Partition	Slurm options	matching nodes	min – max walltime	min – max cores	memory per node	memory per core
`work`	(1)	nodes reserved for short jobs	0 – 2:00:00	1 – 32	32-256 GB	7.75 GB
`work` (default)		all	0 – 24:00:00	1 – 32	32-256 GB	7.75 GB
`work`	`--constraint=sl`	w12xx, w13xx	0 – 24:00:00	1 – 4	32 GB	7.75 GB
`work`	`--constraint=kl`	w14xx, w15xx	0 – 24:00:00	1 – 4	32 GB	7.75 GB
`work`	`--constraint=icx`	w2xxx	0 – 24:00:00	1 – 32	256 GB	7.75 GB

(1) Nodes of all three CPU types are reserved for short jobs. Use the --constraint flag to select a specific CPU type.

Interactive job#

Interactive job (single-core)#

Interactive jobs can be requested by using salloc instead of sbatch and specifying the respective options on the command line.

The environment from the calling shell, like loaded modules, will be inherited by the interactive job.

The following will give you an interactive shell on one node with one core dedicated to you for one hour:

salloc -n 1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Batch job script examples#

Serial job (single-core)#

In this example, the executable will be run using a single core for a total job walltime of 1 hours.

#!/bin/bash -l
#
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

./application

MPI parallel job (single-node)#

In this example, the executable will be run on 1 node with 4 MPI processes per node.

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=2:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun ./application

OpenMP job (single-node)#

For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables:OMP_PLACES=cores, OMP_PROC_BIND=true. For more information, see e.g. the HPC Wiki.

In this example, the executable will be run using 4 OpenMP threads (i.e. one per physical core) for a total job walltime of 2 hours.

#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=2:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK 

./application

Hybrid OpenMP/MPI job (single-node)#

Warning

In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, the executable will be run on one node using 2 MPI processes with 16 OpenMP threads (i.e. one per physical core) for a total job walltime of 1 hours.

#!/bin/bash -l

#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
#SBATCH --time=1:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun ./hybrid_application

Attach to a running job#

See the general documentation on batch processing.

Further information#

Performance characteristics, depending on instruction set for LINPACK:

Nodes with Slurm constraint	`--constraint=sl` (Intel Xeon E3-1240, "Skylake")	`--constraint=kl` (Intel Xeon E3-1240 v6, "Kaby Lake")	`--constraint=icx` (Intel Xeon Gold 6326, "Ice Lake")
single core HPL with AVX 512	n/a	n/a	92.8 GFlop/s
single core HPL with AVX2	58.6 GFlop/s	61.4 GFlop/s	49.2 GFlop/s
single core HPL with AVX	30.2 GFlop/s	31.6 GFlop/s	25.6 GFlop/s
single core HPL with SSE4.2	15.3 GFlop/s	16.0 GFlop/s	13.7 GFlop/s
single core memory bandwidth (stream triad)	20.2 GB/s (25.8 GB/s with NT stores)	21.9 GB/s (27.6 GB/s with NT stores)	15.8 GB/s (21.0 GB/s with NT stores)
throughput HPL per 4 cores with AVX512	n/a	n/a	231 GFlop/s
throughput HPL per 4 cores with AVX2	206 GFlop/s	219 GFlop/s	145 GFlop/s
throughput HPL per 4 cores with AVX	112 GFlop/s	118 GFlop/s	90 GFlop/s
throughput HPL per 4 cores with SSE4.2	57 GFlop/s	60 GFlop/s	48 GFlop/s
throughput memory bandwidth per 4 cores (stream triad)	21.2 GB/s (28.7 GB/s with NT stores)	21.9 GB/s (27.6 GB/s with NT stores)	29 GB/s (31.0 GB/s with NT stores)
theoretical Ethernet throughput per 4 cores	1 GBit/s	1 GBit/s	3 GBit/s
HDD / SSD space per 4 cores	900 GB	870 GB	210 GB

The performance of the new Ice Lake nodes is only better in the High Performance LINPACK (HPL) if AVX512 instructions are used! Saturating the node in throughput mode, the memory bandwidth per (four) core(s) is only slightly higher.