TinyGPU#
TinyGPU is a cluster with different types of consumer and data center Nvidia GPUs.
Hostnames | # nodes (# GPUs) | GPU Type (memory) | CPUs and # cores per node | main memory per node | node-local SSD | Slurm partition |
---|---|---|---|---|---|---|
tg06x |
8 (32) | 4 x Nvidia RTX 2080 Ti (11 GB) | 2 x Intel Xeon Gold 6134 ("Skylake"), 2 x 16 cores/2 x 32 threads @3.2 GHz | 96 GB | 1.8 TB | work |
tg07x |
4 (16) | 4 x Nvidia Tesla V100 (32GB) | 2 x Intel Xeon Gold 6134 ("Skylake"), 2 x 16 cores/2 x 32 threads @3.2 GHz | 96 GB | 2.9 TB | v100 |
tg08x |
7 (56) | 8 x Nvidia Geforce RTX3080 (10GB) | 2 x Intel Xeon Gold 6226R ("Cascade Lake"), 2 x 32 cores/2 x 64 threads | 384 GB | 3.8 TB | work , rtx3080 |
tg09x |
8 (32) | 4 x Nvidia A100 (40GB) | 2 x AMD EPYC 7662 ("Rome", "Zen2"), 128 cores @2.0 GHz | 512 GB RAM | 5.8 TB | a100 |
All nodes have been purchased by specific groups or special projects. These users have priority access and nodes may be reserved exclusively for them.
Access to the machines#
TinyGPU is only available to accounts part of the "Tier3 Grundversorgung", not to NHR project accounts.
See configuring connection settings or SSH in general for configuring your SSH connection.
If successfully configured, the shared frontend node for TinyGPU and TinyFat can be accessed via SSH by:
Software#
TinyGPU runs Ubuntu 20.04 LTS.
All software on NHR@FAU systems, e.g. (commercial) applications, compilers and libraries, is provided using environment modules. These modules are used to setup a custom environment when working interactively or inside batch jobs.
For available software see:
Most software is centrally installed using Spack. By default, only a subset of packages installed via Spack is shown. To see all installed packages, load the 000-all-spack-pkgs
module.
You can install software yourself by using the user-spack functionality.
Containers, e.g. Docker, are supported via Apptainer.
Best practices, known issues#
Specific applications:
cuDNN is installed on all nodes and loading a module is not required.
Except for the a100
partition all nodes have Intel processors supporting AVX512.
Host software compiled specifically for Intel processors might not run on the a100
partition, since the nodes have AMD processors.
Python, conda, conda environments#
Through the python
module, a Conda installation is available.
See our Python documentation for usage, initialization,
and working with conda environments.
Compiler#
For a general overview about compilers, optimizations flags, and targeting a certain CPU or GPU (micro)architecture see the compiler documentation.
CPU#
The CPU types on the frontend node and in the partitions are different. If you plan to compile your host code for a certain CPU architecture, see the table below for the corresponding flags.
Note
When using Intel or Intel classic and using -march=native
or -xHost
on nodes of the a100
partition, the compiler might generate non-optimal code for AMD CPUs.
Note
Software compiled specifically for Intel processors might not run on the a100
partition, since the nodes have AMD CPUs.
The following table shows the compiler flags for targeting TinyGPU's CPUs:
partition | microarchitecture | GCC/LLVM | Intel OneAPI/Classic |
---|---|---|---|
all | Zen2, Skylake, Cascade Lake | -mavx2 -mfma or -march=x86-64-v3 |
-mavx2 -mfma |
work |
Skylake, Cascade Lake | -march=skylake-avx512 |
-march=skylake-avx512 |
rtx3080 |
Cascade Lake | -march=cascadelake |
-march=cascadelake |
v100 |
Skylake | -march=skylake-avx512 |
-march=skylake-avx512 |
a100 |
Zen2 | -march=znver2 |
-mavx2 -mfma |
GPU#
With nvcc
you can target a specific GPU with the -gencode
flag.
It is possible to specify -gencode
multiple times, generating code for multiple
targets in one binary.
For more details, see Nvidia CUDA compilers
documentation.
card | compute capability | functional capability (FA) | virtual architecture (VA) | NVCC flags |
---|---|---|---|---|
A100 | 8.0 | sm_80 |
compute_80 |
-gencode arch=compute_80,code=sm_80 |
Geforce RTX 2080 Ti | 7.5 | sm_75 |
compute_75 |
-gencode arch=compute_75,code=sm_75 |
Geforce RTX 3080 | 8.6 | sm_86 |
compute_86 |
-gencode arch=compute_86,code=sm_86 |
V100 | 7.0 | sm_70 |
compute_70 |
-gencode arch=compute_70,code=sm_70 |
Filesystems#
On all front ends and nodes the filesystems $HOME
, $HPCVAULT
, and $WORK
are mounted.
For details see the filesystems documentation.
Node-local SSD $TMPDIR
#
Data stored on $TMPDIR
will be deleted when the job ends.
Each cluster node has a local SSD that is reachable under $TMPDIR
.
For more information on how to use $TMPDIR
see:
- general documentation of
$TMPDIR
, - staging data, e.g. to speed up training,
- sharing data among jobs on a node.
The storage space of the SSD is at least 1.8 TB and is shared among all jobs on a node. Hence, you might not have access to the full capacity of the SSD.
Batch processing#
Resources are controlled through the batch system Slurm.
Slurm commands are suffixed with .tinygpu
#
The front end node tinyx.nhr.fau.de
serves both the TinyGPU and the TinyFat cluster.
To distinguish which cluster is targeted when a Slurm command is used, Slurm commands for TinyGPU have the .tinygpu
suffix.
This means instead of using:
srun
usesrun.tinygpu
salloc
usesalloc.tinygpu
sbatch
usesbatch.tinygpu
sinfo
usesinfo.tinygpu
squeue
usesqueue.tinygpu
These commands are equivalent to un-suffixed Slurm commands and using the option
--clusters=tinygpu
.
When resubmitting jobs from TinyGPU's compute nodes themselves, only use sbatch
, i.e. without the .tinygpu
suffix.
Partitions#
For each job you have to specify the type and number of GPUs you want to use. Additionally, some partitions also require the partition name itself to be specified.
With each GPU you automatically get a corresponding share of the host's resources like CPU cores and memory.
Compute nodes are shared, however, requested GPUs and host resources are always granted exclusively.
Jobs that do not request at least one GPU will be rejected by the scheduler.
Available partitions and their properties:
Partition | min – max walltime | GPU type (GPU memory) | min – max GPUs | CPU cores per GPU (threads) | Host memory per GPU | Slurm options (1) |
---|---|---|---|---|---|---|
work (default) |
0 – 24:00:00 | Nvidia RTX 2080 Ti (11 GB RAM) / Nvidia Geforce RTX3080 (10GB RAM) | 1 – 4 / 1 – 8 | 8 (16) | 22 GB | --gres=gpu:# or --gres=gpu:rtx3080:# or --gres=gpu:rtx2080ti:# |
rtx3080 |
0 – 24:00:00 | Nvidia Geforce RTX3080 (10GB RAM) | 1 – 8 | 8 (16) | 46 GB | --gres=gpu:# -p rtx3080 |
a100 |
0 – 24:00:00 | Nvidia A100 SXM4/NVLink (40GB RAM) | 1 – 4 | 32 (32) | 117 GB | --gres=gpu:a100:# -p a100 |
v100 |
0 – 24:00:00 | Nvidia Tesla V100 (32GB RAM) | 1 – 4 | 8 (16) | 22 GB | --gres=gpu:v100:# -p v100 |
(1) Replace #
with the number of GPUs you want to request.
Interactive jobs#
Interactive jobs can be requested by using salloc.tinygpu
instead of sbatch.tinygpu
and specifying the respective options on the command line.
The environment from the calling shell, like loaded modules, will be inherited by the interactive job.
Interactive job (single GPU)#
The following will give you an interactive shell on one node with one core and 8000MB RAM dedicated to you for one hour:
Settings from the calling shell (e.g. loaded module paths) will be inherited by the interactive job!
Batch job script examples#
The following examples show general batch scripts. For the following applications we have templates for TinyGPU:
Python (single GPU)#
In this example, we allocate 1 A100 GPU for 6 hours and the corresponding share of CPUs and host main memory.
When the job is started, we load the Python module and activate the conda environment we use for our Python script. After that we can execute the Python script.
#!/bin/bash -l
#
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
module load python
conda activate environment-for-script
python3 train.py
MPI#
In this example, the executable will be run using 2 MPI processes for a total job walltime of 6 hours. The job allocates 1 RTX3080 GPU and the corresponding share of CPUs and main memory automatically.
#!/bin/bash -l
#
#SBATCH --ntasks=2
#SBATCH --gres=gpu:rtx3080:1
#SBATCH --partition=rtx3080
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
srun --mpi=pmi2 ./application
Hybrid MPI/OpenMP#
Warning
In recent Slurm versions, the value of --cpus-per-task
is no longer automatically propagated to srun
, leading to errors in the application start. This value has to be set manually via the variable SRUN_CPUS_PER_TASK
.
In this example, 1 A100 GPU is allocated. The executable will be run using 2 MPI processes with 16 OpenMP threads each for a total job walltime of 6 hours. 32 cores are allocated in total and each OpenMP thread is running on a physical core.
#!/bin/bash -l
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:a100:1
#SBATCH --partition=a100
#SBATCH --time=6:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
# for Slurm version >22.05: cpus-per-task has to be set again for srun
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
srun --mpi=pmi2 ./hybrid_application
Attach to a running job#
See the general documentation on batch processing.