TinyGPU#

TinyGPU is a cluster with different types of consumer and data center Nvidia GPUs.

Hostnames	# nodes (# GPUs)	GPU Type (memory)	CPUs and # cores per node	main memory per node	node-local SSD	Slurm partition
`tg06x`	8 (32)	4 x Nvidia RTX 2080 Ti (11 GB)	2 x Intel Xeon Gold 6134 ("Skylake"), 2 x 16 cores/2 x 32 threads @3.2 GHz	96 GB	1.8 TB	`work`
`tg07x`	4 (16)	4 x Nvidia Tesla V100 (32GB)	2 x Intel Xeon Gold 6134 ("Skylake"), 2 x 16 cores/2 x 32 threads @3.2 GHz	96 GB	2.9 TB	`v100`
`tg08x`	7 (56)	8 x Nvidia Geforce RTX3080 (10GB)	2 x Intel Xeon Gold 6226R ("Cascade Lake"), 2 x 32 cores/2 x 64 threads	384 GB	3.8 TB	`work`, `rtx3080`
`tg09x`	8 (32)	4 x Nvidia A100 (40GB)	2 x AMD EPYC 7662 ("Rome", "Zen2"), 128 cores @2.0 GHz	512 GB RAM	5.8 TB	`a100`

All nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.

Access to the machines#

TinyGPU is only available to accounts part of the "Tier3 Grundversorgung", not to NHR project accounts.

See configuring connection settings or SSH in general for configuring your SSH connection.

If successfully configured, the shared frontend node for TinyGPU and TinyFat can be accessed via SSH by:

ssh tinyx.nhr.fau.de

Software#

TinyGPU runs Ubuntu 20.04 LTS.

All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.

For available software see:

Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs module. You can install software yourself by using the user-spack functionality.

Containers, e.g. Docker, are supported via Apptainer.

Best practices, known issues#

Specific applications:

cuDNN is installed on all nodes and loading a module is not required.

Except for the a100 partition all nodes have Intel processors supporting AVX512. Host software compiled specifically for Intel processors might not run on the a100 partition, since the nodes have AMD processors.

Python, conda, conda environments#

Through the python module, a Conda installation is available. See our Python documentation for usage, initialization, and working with conda environments.

Compiler#

For a general overview about compilers, optimizations flags, and targeting a certain CPU or GPU (micro)architecture see the compiler documentation.

CPU#

The CPU types on the frontend node and in the partitions are different. If you plan to compile your host code for a certain CPU architecture, see the table below for the corresponding flags.

Note

When using Intel or Intel classic and using -march=native or -xHost on nodes of the a100 partition, the compiler might generate non-optimal code for AMD CPUs.

Note

Software compiled specifically for Intel processors might not run on the a100 partition, since the nodes have AMD CPUs.

The following table shows the compiler flags for targeting TinyGPU's CPUs:

partition	microarchitecture	GCC/LLVM	Intel OneAPI/Classic
all	Zen2, Skylake, Cascade Lake	`-mavx2 -mfma` or `-march=x86-64-v3`	`-mavx2 -mfma`
`work`	Skylake, Cascade Lake	`-march=skylake-avx512`	`-march=skylake-avx512`
`rtx3080`	Cascade Lake	`-march=cascadelake`	`-march=cascadelake`
`v100`	Skylake	`-march=skylake-avx512`	`-march=skylake-avx512`
`a100`	Zen2	`-march=znver2`	`-mavx2 -mfma`

GPU#

With nvcc you can target a specific GPU with the -gencode flag. It is possible to specify -gencode multiple times, generating code for multiple targets in one binary. For more details, see Nvidia CUDA compilers documentation.

card	compute capability	functional capability (FA)	virtual architecture (VA)	NVCC flags
A100	8.0	`sm_80`	`compute_80`	`-gencode arch=compute_80,code=sm_80`
Geforce RTX 2080 Ti	7.5	`sm_75`	`compute_75`	`-gencode arch=compute_75,code=sm_75`
Geforce RTX 3080	8.6	`sm_86`	`compute_86`	`-gencode arch=compute_86,code=sm_86`
V100	7.0	`sm_70`	`compute_70`	`-gencode arch=compute_70,code=sm_70`

Filesystems#

On all front ends and nodes the filesystems $HOME, $HPCVAULT, and $WORK are mounted. For details see the filesystems documentation.

Node-local SSD `$TMPDIR`#

Data stored on $TMPDIR will be deleted when the job ends.

Each cluster node has a local SSD that is reachable under $TMPDIR.

For more information on how to use $TMPDIR see:

general documentation of $TMPDIR,
staging data, e.g. to speed up training,
sharing data among jobs on a node.

The storage space of the SSD is at least 1.8 TB and is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.

Batch processing#

Resources are controlled through the batch system Slurm.

Slurm commands are suffixed with `.tinygpu`#

The front end node tinyx.nhr.fau.de serves both the TinyGPU and the TinyFat cluster. To distinguish which cluster is targeted when a Slurm command is used, Slurm commands for TinyGPU have the .tinygpu suffix.

This means instead of using:

srun use srun.tinygpu
salloc use salloc.tinygpu
sbatch use sbatch.tinygpu
sinfo use sinfo.tinygpu
squeue use squeue.tinygpu

These commands are equivalent to un-suffixed Slurm commands and using the option --clusters=tinygpu.

When resubmitting jobs from TinyGPU's compute nodes themselves, only use sbatch, i.e. without the .tinygpu suffix.

Partitions#

For each job you have to specify the type and number of GPUs you want to use. Additionally, some partitions also require the partition name itself to be specified.

With each GPU you automatically get a corresponding share of the host's resources like CPU cores and memory.

Compute nodes are shared, however, requested GPUs and host resources are always granted exclusively.

Jobs that do not request at least one GPU will be rejected by the scheduler.

Available partitions and their properties:

Partition	min – max walltime	GPU type (GPU memory)	min – max GPUs	CPU cores per GPU (threads)	Host memory per GPU	Slurm options (1)
`work` (default)	0 – 24:00:00	Nvidia RTX 2080 Ti (11 GB RAM) / Nvidia Geforce RTX3080 (10GB RAM)	1 – 4 / 1 – 8	8 (16)	22 GB	`--gres=gpu:#` or `--gres=gpu:rtx3080:#` or `--gres=gpu:rtx2080ti:#`
`rtx3080`	0 – 24:00:00	Nvidia Geforce RTX3080 (10GB RAM)	1 – 8	8 (16)	46 GB	`--gres=gpu:# -p rtx3080`
`a100`	0 – 24:00:00	Nvidia A100 SXM4/NVLink (40GB RAM)	1 – 4	32 (32)	117 GB	`--gres=gpu:a100:# -p a100`
`v100`	0 – 24:00:00	Nvidia Tesla V100 (32GB RAM)	1 – 4	8 (16)	22 GB	`--gres=gpu:v100:# -p v100`

(1) Replace # with the number of GPUs you want to request.

Interactive jobs#

Interactive jobs can be requested by using salloc.tinygpu instead of sbatch.tinygpu and specifying the respective options on the command line.

The environment from the calling shell, like loaded modules, will be inherited by the interactive job.

Interactive job (single GPU)#

The following will give you an interactive shell on one node with one core and 8000MB RAM dedicated to you for one hour:

salloc.tinygpu --gres=gpu:1 --time=01:00:00

Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!

Batch job script examples#

The following examples show general batch scripts. For the following applications we have templates for TinyGPU:

Python (single GPU)#

In this example, we allocate 1 A100 GPU for 6 hours and the corresponding share of CPUs and host main memory.

When the job is started, we load the Python module and activate the conda environment we use for our Python script. After that we can execute the Python script.

#!/bin/bash -l
#
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

module load python
conda activate environment-for-script

python3 train.py

MPI#

In this example, the executable will be run using 2 MPI processes for a total job walltime of 6 hours. The job allocates 1 RTX3080 GPU and the corresponding share of CPUs and main memory automatically.

#!/bin/bash -l
#
#SBATCH --ntasks=2
#SBATCH --gres=gpu:rtx3080:1
#SBATCH --partition=rtx3080
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

srun --mpi=pmi2 ./application

Hybrid MPI/OpenMP#

Warning

In recent Slurm versions, the value of --cpus-per-task is no longer automatically propagated to srun, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK.

In this example, 1 A100 GPU is allocated. The executable will be run using 2 MPI processes with 16 OpenMP threads each for a total job walltime of 6 hours. 32 cores are allocated in total and each OpenMP thread is running on a physical core.

#!/bin/bash -l

#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=6:00:00
#SBATCH --export=NONE

unset SLURM_EXPORT_ENV

# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK

srun --mpi=pmi2 ./hybrid_application

Attach to a running job#

See the general documentation on batch processing.