Skip to content


TinyGPU is a cluster with different types of consumer and data center Nvidia GPUs.

Hostnames # nodes (# GPUs) GPU Type (memory) CPUs and # cores per node main memory per node node-local SSD Slurm partition
tg06x 8 (32) 4 x Nvidia RTX 2080 Ti (11 GB) 2 x Intel Xeon Gold 6134 ("Skylake"), 2 x 16 cores/2 x 32 threads @3.2 GHz 96 GB 1.8 TB work
tg07x 4 (16) 4 x Nvidia Tesla V100 (32GB) 2 x Intel Xeon Gold 6134 ("Skylake"), 2 x 16 cores/2 x 32 threads @3.2 GHz 96 GB 2.9 TB v100
tg08x 7 (56) 8 x Nvidia Geforce RTX3080 (10GB) 2 x Intel Xeon Gold 6226R ("Cascade Lake"), 2 x 32 cores/2 x 64 threads 384 GB 3.8 TB work, rtx3080
tg09x 8 (32) 4 x Nvidia A100 (40GB) 2 x AMD EPYC 7662 ("Rome", "Zen2"), 128 cores @2.0 GHz 512 GB RAM 5.8 TB a100

All nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.

Access to the machines#

TinyGPU is only available to accounts part of the "Tier3 Grundversorgung", not to NHR project accounts.

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, the shared frontend node for TinyGPU and TinyFat can be accessed via SSH by:



TinyGPU runs Ubuntu 20.04 LTS.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Best practices, known issues#

Specific applications:

cuDNN is installed on all nodes and loading a module is not required.

Except for the a100 partition all nodes have Intel processors supporting AVX512. Host software compiled specifically for Intel processors might not run on the a100 partition, since the nodes have AMD processors.

Python, conda, conda environments#

Through the python module, an Anaconda installation is available. See our Python documentation for usage, initialization, and working with conda environments.


For a general overview about compilers, optimizations flags, and targeting a certain CPU or GPU (micro)architecture see the compiler documentation.


The CPU types on the frontend node and in the partitions are different. If you plan to compile your host code for a certain CPU architecture, see the table below for the corresponding flags.


When using Intel or Intel classic and using -march=native or -xHost on nodes of the a100 partition, the compiler might generate non-optimal code for AMD CPUs.


Software compiled specifically for Intel processors might not run on the a100 partition, since the nodes have AMD CPUs.

The following table shows the compiler flags for targeting TinyGPU's CPUs:

partition microarchitecture GCC/LLVM Intel OneAPI/Classic
all Zen2, Skylake, Cascade Lake -mavx2 -mfma or -march=x86-64-v3 -mavx2 -mfma
work Skylake, Cascade Lake -march=skylake-avx512 -march=skylake-avx512
rtx3080 Cascade Lake -march=cascadelake -march=cascadelake
v100 Skylake -march=skylake-avx512 -march=skylake-avx512
a100 Zen2 -march=znver2 -mavx2 -mfma


With nvcc you can target a specific GPU with the -gencode flag. It is possible to specify -gencode multiple times, generating code for multiple targets in one binary. For more details, see Nvidia CUDA compilers documentation.

card compute capability functional capability (FA) virtual architecture (VA) NVCC flags
A100 8.0 sm_80 compute_80 -gencode arch=compute_80,code=sm_80
Geforce RTX 2080 Ti 7.5 sm_75 compute_75 -gencode arch=compute_75,code=sm_75
Geforce RTX 3080 8.6 sm_86 compute_86 -gencode arch=compute_86,code=sm_86
V100 7.0 sm_70 compute_70 -gencode arch=compute_70,code=sm_70


On all front ends and nodes the filesystems $HOME, $HPCVAULT, and $WORK are mounted. For details see the filesystems documentation.

Node-local SSD $TMPDIR#

Data stored on $TMPDIR will be deleted when the job ends.

Each cluster node has a local SSD that is reachable under $TMPDIR.

For more information on how to use $TMPDIR see:

The storage space of the SSD is at least 1.8 TB and is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.

Batch processing#

Resources are controlled through the batch system Slurm.

Slurm commands are suffixed with .tinygpu#

The front end node serves both the TinyGPU and the TinyFat cluster. To distinguish which cluster is targeted when a Slurm command is used, Slurm commands for TinyGPU have the .tinygpu suffix.

This means instead of using:

  • srun use srun.tinygpu
  • salloc use salloc.tinygpu
  • sbatch use sbatch.tinygpu
  • sinfo use sinfo.tinygpu

These commands are equivalent to un-suffixed Slurm commands and using the option --clusters=tinygpu.

When resubmitting jobs from TinyGPU's compute nodes themselves, only use sbatch, i.e. without the .tinygpu suffix.


For each job you have to specify the type and number of GPUs you want to use. Additionally, some partitions also require the partition name itself to be specified.

With each GPU you automatically get a corresponding share of the host's resources like CPU cores and memory.

Compute nodes are shared, however, requested GPUs and host resources are always granted exclusively.

Jobs that do not request at least one GPU will be rejected by the scheduler.

Available partitions and their properties:

Partition min – max walltime GPU type (GPU memory) min – max GPUs CPU cores per GPU (threads) Host memory per GPU Slurm options (1)
work (default) 0 – 24:00:00 Nvidia RTX 2080 Ti (11 GB RAM) / Nvidia Geforce RTX3080 (10GB RAM) 1 – 4 / 1 – 8 8 (16) 22 GB --gres=gpu:# or --gres=gpu:rtx3080:# or --gres=gpu:rtx2080ti:#
rtx3080 0 – 24:00:00 Nvidia Geforce RTX3080 (10GB RAM) 1 – 8 8 (16) 46 GB --gres=gpu:# -p rtx3080
a100 0 – 24:00:00 Nvidia A100 SXM4/NVLink (40GB RAM) 1 – 4 32 (32) 117 GB --gres=gpu:a100:# -p a100
v100 0 – 24:00:00 Nvidia Tesla V100 (32GB RAM) 1 – 4 8 (16) 22 GB --gres=gpu:v100:# -p v100

(1) Replace # with the number of GPUs you want to request.

Interactive jobs#

Interactive jobs can be requested by using salloc.tinygpu instead of sbatch.tinygpu and specifying the respective options on the command line.

The environment from the calling shell, like loaded modules, will be inherited by the interactive job.

Interactive job (single GPU)#

The following will give you an interactive shell on one node with one core and 8000MB RAM dedicated to you for one hour:

salloc.tinygpu --gres=gpu:1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Batch job script examples#

The following examples show general batch scripts. For the following applications we have templates for TinyGPU:

Python (single GPU)#

In this example, we allocate 1 A100 GPU for 6 hours and the corresponding share of CPUs and host main memory.

When the job is started, we load the Python module and activate the conda environment we use for our Python script. After that we can execute the Python script.

#!/bin/bash -l
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=6:00:00
#SBATCH --export=NONE


module load python
conda activate environment-for-script



In this example, the executable will be run using 2 MPI processes for a total job walltime of 6 hours. The job allocates 1 RTX3080 GPU and the corresponding share of CPUs and main memory automatically.

#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --gres=gpu:rtx3080:1
#SBATCH --partition=rtx3080
#SBATCH --time=6:00:00
#SBATCH --export=NONE


srun --mpi=pmi2 ./application

Hybrid MPI/OpenMP#


In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, 1 A100 GPU is allocated. The executable will be run using 2 MPI processes with 16 OpenMP threads each for a total job walltime of 6 hours. 32 cores are allocated in total and each OpenMP thread is running on a physical core.

#!/bin/bash -l

#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=6:00:00
#SBATCH --export=NONE


# set number of threads to requested cpus-per-task
# for Slurm version >22.05: cpus-per-task has to be set again for srun

srun --mpi=pmi2 ./hybrid_application

Attach to a running job#

See the general documentation on batch processing.