Woody#
The Woody cluster consists of node with different generations of Intel CPUs and is used for throughput computing.
Hostnames | # nodes | CPUs and # cores per node | Main memory per node | node-local SSD | Slurm partition | Slurm constraint |
---|---|---|---|---|---|---|
w12xx ,w13xx |
64 | 1 x Intel Xeon E3-1240 v5 ("Skylake"), 4 cores @3.5 GHz | 32 GB | 1 TB | work |
sl |
w14xx ,w15xx |
112 | 1 x Intel Xeon E3-1240 v6 ("Kaby Lake"), 4 cores @3.7 GHz | 32 GB | 900 GB | work |
kl |
w22xx - w25xx (1) |
110 | 2 x Intel Xeon Gold 6326 ("Ice Lake"), 2 x 16 cores @2.9 GHz | 256 GB | 1.8 TB | work |
icx |
(1) 40 nodes are financed by and dedicated to ECAP. 40 nodes are financed by and belong to a project from the Physics department.
Accessing woody#
Woody is a Tier3 resource serving FAU's basic needs. Therefore, NHR accounts are not enabled by default.
See configuring connection settings or SSH in general for configuring your SSH connection.
If successfully configured, Woody can be accessed via SSH by:
You will then be redirected to one of the login nodes.
Software#
Woody runs AlmaLinux 8 that is binary compatible with Red Hat Enterprise Linux 8.
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs
module.
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
Python, conda, conda environments#
Through the python
module, a Conda installation is available.
See our Python documentation for usage, initialization,
and working with conda environments.
Compiler#
For a general overview about compilers, optimizations flags, and targeting a certain CPU micro-architecture see the compiler documentation.
The CPU types on the frontend node and in the partitions are different. The compilation flags should be adjusted according to the partition you plan to run your code on. See the following table for details.
Only nodes with Slurm constraint icx
support AVX512.
-
Code compiled with AVX512 support will not run on other nodes and result in
Illegal Instruction
errors. -
The frontend nodes support AVX512. Code compiled on these nodes with
-march=native
or-xHost
will require AVX512. -
Regarding AVX512 performance, see our optimization notices.
The following table shows the compiler flags for targeting Woody's CPUs:
Slurm constraint | microarchitecture | GCC/LLVM | Intel oneAPI/Classic |
---|---|---|---|
none | Skylake, Kaby Lake, Ice Lake | -march=skylake or -march=x86-64-v3 |
-march=skylake (1) |
icx |
Ice Lake | -march=icelake-server |
-march=icelake-server |
sl |
Skylake | -march=skylake |
-march=skylake |
kl |
Kaby Lake | -march=skylake |
-march=skylake |
(1) See Targeting multiple architectures on how to generate on optimized code for each micro-architecture.
Filesystems#
On all front ends and nodes the filesystems $HOME
, $HPCVAULT
, and $WORK
are mounted.
For details see the filesystems documentation.
Node local SSD $TMPDIR
#
Data stored on $TMPDIR
will be deleted when the job ends.
Each cluster node has a local SSD that is reachable under $TMPDIR
.
For more information on how to use $TMPDIR
see:
- general documentation of
$TMPDIR
, - staging data, e.g. to speed up training,
- sharing data among jobs on a node.
The storage space of the SSD is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.
Batch processing#
Resources are controlled through the batch system Slurm.
Only single node jobs are allowed.
Compute are shared, however, requested resources are always granted exclusively. The granularity of batch allocations are individual cores. For each requested core 7.75 GB of main memory are allocated.
Partition | Slurm options | matching nodes | min – max walltime | min – max cores | memory per node | memory per core |
---|---|---|---|---|---|---|
work |
(1) | nodes reserved for short jobs | 0 – 2:00:00 | 1 – 32 | 32-256 GB | 7.75 GB |
work (default) |
all | 0 – 24:00:00 | 1 – 32 | 32-256 GB | 7.75 GB | |
work |
--constraint=sl |
w12xx, w13xx | 0 – 24:00:00 | 1 – 4 | 32 GB | 7.75 GB |
work |
--constraint=kl |
w14xx, w15xx | 0 – 24:00:00 | 1 – 4 | 32 GB | 7.75 GB |
work |
--constraint=icx |
w2xxx | 0 – 24:00:00 | 1 – 32 | 256 GB | 7.75 GB |
(1) Nodes of all three CPU types are reserved for short jobs. Use the --constraint
flag to select a specific CPU type.
Interactive job#
Interactive job (single-core)#
Interactive jobs can be requested by using salloc
instead of sbatch
and specifying the respective options on the command line.
The environment from the calling shell, like loaded modules, will be inherited by the interactive job.
The following will give you an interactive shell on one node with one core dedicated to you for one hour:
Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!
Batch job script examples#
Serial job (single-core)#
In this example, the executable will be run using a single core for a total job walltime of 1 hours.
#!/bin/bash -l
#
#SBATCH --ntasks=1
#SBATCH --time=1:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
./application
MPI parallel job (single-node)#
In this example, the executable will be run on 1 node with 4 MPI processes per node.
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=2:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun ./application
OpenMP job (single-node)#
For more efficient computation, OpenMP threads should be pinned to the compute cores. This can be achieved by the following environment variables:OMP_PLACES=cores, OMP_PROC_BIND=true
. For more information, see e.g. the HPC Wiki.
In this example, the executable will be run using 4 OpenMP threads (i.e. one per physical core) for a total job walltime of 2 hours.
#!/bin/bash -l
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --time=2:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
./application
Hybrid OpenMP/MPI job (single-node)#
Warning
In recent Slurm versions, the value of --cpus-per-task
is no longer automatically propagated to srun
, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK
.
In this example, the executable will be run on one node using 2 MPI processes with 16 OpenMP threads (i.e. one per physical core) for a total job walltime of 1 hours.
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=16
#SBATCH --time=1:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun ./hybrid_application
Attach to a running job#
See the general documentation on batch processing.
Further information#
Performance characteristics, depending on instruction set for LINPACK:
Nodes with Slurm constraint | --constraint=sl (Intel Xeon E3-1240, "Skylake") |
--constraint=kl (Intel Xeon E3-1240 v6, "Kaby Lake") |
--constraint=icx (Intel Xeon Gold 6326, "Ice Lake") |
---|---|---|---|
single core HPL with AVX 512 | n/a | n/a | 92.8 GFlop/s |
single core HPL with AVX2 | 58.6 GFlop/s | 61.4 GFlop/s | 49.2 GFlop/s |
single core HPL with AVX | 30.2 GFlop/s | 31.6 GFlop/s | 25.6 GFlop/s |
single core HPL with SSE4.2 | 15.3 GFlop/s | 16.0 GFlop/s | 13.7 GFlop/s |
single core memory bandwidth (stream triad) | 20.2 GB/s (25.8 GB/s with NT stores) | 21.9 GB/s (27.6 GB/s with NT stores) | 15.8 GB/s (21.0 GB/s with NT stores) |
throughput HPL per 4 cores with AVX512 | n/a | n/a | 231 GFlop/s |
throughput HPL per 4 cores with AVX2 | 206 GFlop/s | 219 GFlop/s | 145 GFlop/s |
throughput HPL per 4 cores with AVX | 112 GFlop/s | 118 GFlop/s | 90 GFlop/s |
throughput HPL per 4 cores with SSE4.2 | 57 GFlop/s | 60 GFlop/s | 48 GFlop/s |
throughput memory bandwidth per 4 cores (stream triad) | 21.2 GB/s (28.7 GB/s with NT stores) | 21.9 GB/s (27.6 GB/s with NT stores) | 29 GB/s (31.0 GB/s with NT stores) |
theoretical Ethernet throughput per 4 cores | 1 GBit/s | 1 GBit/s | 3 GBit/s |
HDD / SSD space per 4 cores | 900 GB | 870 GB | 210 GB |
The performance of the new Ice Lake nodes is only better in the High Performance LINPACK (HPL) if AVX512 instructions are used! Saturating the node in throughput mode, the memory bandwidth per (four) core(s) is only slightly higher.