Working with NVIDIA GPUs#
CUDA compilers and libraries#
The cuda
module provides the CUDA compilers and corresponding CUDA (runtime) libraries.
Loading the appropriate module, e.g. module load cuda/12.1.0
, adjusts the typical environment variables
and also sets CUDA_HOME
or CUDA_INSTALL_PATH
, which can be used with make
and cmake
.
The Nvidia HPC compilers (formerly PGI) are part of the nvhpc
modules.
GPU statistics in job output#
Statistics on GPU utilization are added to the end of the job output from Slurm. For each binary that used CUDA, reported over time are: GPU name, bus ID, process ID, GPU memory utilization, GPU maximum memory usage, and overall execution time.
The output will look like this:
...
=== GPU utilization ===
gpu_name, gpu_bus_id, pid, gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 134883, 92 %, 11 %, 395 MiB, 244633 ms
NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 135412, 92 %, 11 %, 395 MiB, 243797 ms
In this example, two CUDA binary calls happened; both were running on
the same GPU (00000000:1A:00.0
). The average GPU utilization was 92%,
11% of the GPU memory or 395 MiB have been used and each binary run for
about 244 seconds.
NVIDIA System Management Interface#
nvidia-smi
(NVIDIA System Management Interface),
is a command line utility that shows GPU utilization and processes currently using the GPU and is available
on all our GPU clusters.
Example output of nvidia-smi
user@a0522 $ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:01:00.0 Off | Off |
| 0% 45C P8 150W / 300W | 1192MiB / 49140MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 48484 C .../environments/demo/bin/python 1092MiB |
+---------------------------------------------------------------------------------------+
The output lists current GPU properties, like current utilization (top right), memory utilization (top center), and a list of processes utilizing the GPU.
Most useful is nvidia-smi
, when you attach to a running job, to check the job's GPU utilization.
nvidia-smi
also can continuously report information about GPU usage:
-
top
like reporting -
detailed reporting:
nvtop
GPU status viewer#
nvtop
is a (h)top
like task monitor for AMD and NVIDIA GPUs
and shows GPU details like memory, utilization, temperature, etc. as well as information about the
processes executing on the GPUs.
nvtop
is available as a module on Alex and
TinyGPU.
NVIDIA Multi-Process Service#
The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.
Using MPS with single-GPU jobs#
# set necessary environment variables and start the MPS daemon
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps.$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log.$SLURM_JOB_ID
nvidia-cuda-mps-control -d
# do your work (a.out is just a placeholder)
./a.out -param 1 &
./a.out -param 2 &
./a.out -param 3 &
./a.out -param 4 &
wait
# stop the MPS daemon
echo quit | nvidia-cuda-mps-control
Using MPS with multi-GPU jobs#
# set necessary environment variables and start the MPS daemon
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
echo "starting mps server for $GPU"
export CUDA_VISIBLE_DEVICES=$GPU
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log-${GPU}.$SLURM_JOB_ID
nvidia-cuda-mps-control -d
done
# do your work - you may need to set CUDA_MPS_PIPE_DIRECTORY correctly per process!!
...
# cleanup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
echo "stopping mps server for $GPU"
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
echo 'quit' | nvidia-cuda-mps-control
done
See also http://cudamusing.blogspot.com/2013/07/enabling-cuda-multi-process-service-mps.html and https://stackoverflow.com/questions/36015005/cuda-mps-servers-fail-to-start-on-workstation-with-multiple-gpus.
GPU-Profiling with NVIDIA tools#
NVIDIA offers two prominent profiling tools:
nsys
(Nsight Systems) for profiling whole applicationsncu
(Nsight Compute) for zeroing in on specific performance characteristics of single kernels
nsys
- Nsight Systems#
An overview of application behavior can be obtained by running
Transfer the resulting report file to your local machine and opening it with a local installation of Nsight Systems.
More command line options are available, as specified in the documentation. Some of the most relevant ones are:
--stats=true
: summarizes obtained performance data after the application has finished and prints it-o
: specifies the file name for the generated report,my-profile
in this example--force-overwrite=true
: overwrite the report, if it already exists
A full example could be
The resulting report files can grow quite large, depending on the application examined. Please make sure to store it on an appropriate filesystem.
ncu
- Nsight Compute#
After getting an execution time overview, more in-depth analysis can be carried out by using Nsight Compute via
which by default profiles all kernels in the application. This can be fine-tuned by providing options such as
to skip the first two kernel launches and limit the number of profiled kernels to 1. Profiling can also be limited to specific kernels using
with an assumed kernel name of my_kernel
. In most cases, specifying
metrics to be measured is recommended as well, e.g. with
for the data volumes read and written from and to the GPU's main memory. Further information on available metrics can be found here and some key metrics are listed here.\ Other command line options can be reviewed in the documentation.
A full profiling call could be
ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 --metrics dram__bytes_read.sum,dram__bytes_write.sum ./a.out
LIKWID#
LIKWID 5.0 also supports NVIDIA GPUs. In order to simplify the transition from CPUs to GPUs for the users, the LIKWID API for GPUs is basically a copy of the LIKWID API for CPUs with a few differences. For the command line applications, new options are introduced. A tutorial on how to use LIKWID with NVIDIA GPUs can be found on the LIKWID GitHub page.