Skip to content

Compiler#

This page provides an overview about installed compilers and their usage on our clusters.

Overview#

compiler GCC LLVM Intel Classic Intel oneAPI NVHPC Nvidia CUDA
environment module gcc llvm intel intel nvhpc cuda
C gcc clang icc icx nvc nvcc
C++ g++ clang++ icpc icpx nvc++ nvcc
Fortran gfortran flang ifort ifx nvfortran
optimize for current host -march=native -march=native Intel host: -march=native or -xHost, AMD host: -mfma -mavx2 Intel host: -march=native or -xHost, AMD host: -mfma -mavx2 -tp=native GPU: -arch=native, CPU: -Xcompiler -march=native (1)
enable OpenMP -fopenmp -fopenmp -qopenmp -qopenmp or -fiopenmp (2) -mp t.d.b.
vendor documentation GCC Clang Intel (3) Intel (3) Nvidia Nvidia

Notes:

  • (1) -Xcompiler passes the following option to the host compiler. Adapt -march=native to the flag, your host compiler uses to automatically detect the host's CPU type.

  • (2) The compiler also recognizes the deprecated option -fopenmp. However -fopenmp enables the usage of the LLVM OpenMP runtime instead of the Intel OpenMP runtime (with possible Intel extensions).

  • (3) Intel does not have stable links to latest documentation, so you have to search it on your own in the Documentation Library.

Missing compiler module or version#

If a compiler module is missing or the version you are needing is not available you can either

  • install the compiler locally, e.g. compile it manually or via Spack or
  • contact us under hpc-support@fau.de and request installation on the whole cluster (we will decide if it makes sense)

Compiling for architectures supporting AVX-512#

The frequency of Intel processors can decrease, when AVX(2) or AVX-512 instruction are used. Typically AVX-512 floating point instructions cause a larger frequency reduction then AVX(2) instructions.

Because of this reason, compilers might generate code that only uses the lower half of AVX-512 registers (ymm part), instead of the full width (zmm). To force GCC, Clang, and Intel Classic/oneAPI compilers to use the full zmm registers use the flags:

  • GCC, Clang, Intel OneAPI: -mprefer-vector-width=512
  • Intel Classic: -qopt-zmm-usage=high

Generally we cannot give a recommendation which instruction set to target and if the usage of full-width registers is beneficial on an AVX-512 capable systems. You have to do your own benchmarks for your application.

Optimizations flags#

All compilers support the optimization level -O3, which typically allows for high and safe optimizations.

With -Ofast compiler typically enable non-standard compliant floating-point optimizations that can deliver more performance than with -O3.

However, this can lead to errors, exceptions, or divergence of your simulations, depending on the numerical stability of your applications. If in doubt, consult the application's vendor/authors and the compiler's documentation.

Targeting multiple architectures#

Targeting multiple CPU micro-architectures can be handy if a application should be optimized for cluster where partitions have with different micro-architectures.

Intel

Intel compilers can generate multiple code paths optimized for different CPU micro-architectures in one binary. This increases the size of the binary.

For compilation specify:

  • -march=...: the oldest supported architecture
  • -ax: a comma separated list of all newer architectures to be supported

For example:

icc -march=icelake-server -ax sapphirerapids ...

This will produce a binary that requires at least the Ice Lake Server micro-architecture, but also generates an optimized code path for the Sapphire Rapids micro-architecture.

GCC and LLVM

Having multiple optimized code for different micro-architectures, is with GCC and LLVM not (easily) possible.

However, at compile time you can specify a minimum supported architecture with -march=... and instruct the compiler to tune for a newer one with -mtune=.... In this example the binary requires at least Ice Lake Server micro-architecture or later and will be tuned for the Sapphire Rapids micro-architecture.

gcc -march=icelake-server -mtune=sapphirerapids ...

GNU compiler collection#

module: gcc/<version>

gcc of the OS#

Without loading a gcc module, the installed gcc shipped with the OS is used and it is rather old.

Intel and Intel classic compilers#

module: intel/<version>

oneAPI DPC++ compiler#

The intel module also provides the oneAPI DPC++ compiler dpcpp.

Compiling for AMD systems#

The Intel and Intel classic compilers might not generate the best code when using -march=native or -xHost on an AMD system. On an AMD systems supporting AVX2 we recommend to use the flags:

-mavx2 -mfma

Endianness conversion for Fortran#

Little-endian (LE) is used by x86-based processors. LE means the least-significant byte (LSB) of a multi-byte word is stored first. This format is used for unformatted Fortran data files.

To transparently import big-endian (BE) files, e.g. from files produced on IBM Power or NEC SX systems, the Intel Fortran compilers can convert the endianness automatically during read and write operations, even for different Fortran units.

Setting the environment variable F_UFMTENDIAN enables the conversion. Examples of possible values for F_UFMTENDIAN are:

F_UFMTENDIAN treat input/output as
big BE
little LE, the default
big:10,20 LE, except for units 10 and 20
"big;little:8" BE, except for unit 8

To treat input and output as big-endian:

F_UFMTENDIAN=big ./fortran-application

NVHPC compilers#

module: nvhpc/<version>

The Nvidia HPC compilers (NVHPC) were formerly known as the PGI compilers.

Nvidia CUDA compilers#

module: cuda/<version>

The Nvidia CUDA C and C++ compiler driver compiles CUDA code.

For the host part, nvcc relies on an installed host compiler. By default, this is gcc, but others are supported too, e.g. see at Nvidia.

Nvidia GPU capabilities#

card compute capability functional capability (FC) virtual architecture (VA)
V100 7.0 sm_70 compute_70
A100 8.0 sm_80 compute_80
A40 8.6 sm_86 compute_86
Geforce RTX 2080 Ti 7.5 sm_75 compute_75
GeForce RTX 3080 8.6 sm_86 compute_86

More information can be found at Nvidia:

Compiling for certain capabilities#

Capabilities can be specified via the -gencode flag:

-gencode arch=VA,code=[FC1[,FC2...]]

where VA is the virtual architecture and FC the functional capability.

It is possible to specify multiple virtual architectures by repeating the -gencode flag. For example:

nvcc ... -gencode arch=compute_80,code=\"sm_80,sm_86\" -gencode arch=compute_70,code=sm_70

Multiple values for code must be wrapped into \".