Working with NVIDIA GPUs#

CUDA compilers and libraries#

The cuda module provides the CUDA compilers and corresponding CUDA (runtime) libraries.

Loading the appropriate module, e.g. module load cuda/12.1.0, adjusts the typical environment variables and also sets CUDA_HOME or CUDA_INSTALL_PATH, which can be used with make and cmake.

The Nvidia HPC compilers (formerly PGI) are part of the nvhpc modules.

GPU statistics in job output#

Statistics on GPU utilization are added to the end of the job output from Slurm. For each binary that used CUDA, reported over time are: GPU name, bus ID, process ID, GPU memory utilization, GPU maximum memory usage, and overall execution time.

The output will look like this:

...
=== GPU utilization ===
gpu_name, gpu_bus_id, pid, gpu_utilization [%], mem_utilization [%], max_memory_usage [MiB], time [ms]
NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 134883, 92 %, 11 %, 395 MiB, 244633 ms
NVIDIA GeForce RTX 3080, 00000000:1A:00.0, 135412, 92 %, 11 %, 395 MiB, 243797 ms

In this example, two CUDA binary calls happened; both were running on the same GPU (00000000:1A:00.0). The average GPU utilization was 92%, 11% of the GPU memory or 395 MiB have been used and each binary run for about 244 seconds.

NVIDIA System Management Interface#

nvidia-smi (NVIDIA System Management Interface), is a command line utility that shows GPU utilization and processes currently using the GPU and is available on all our GPU clusters.

Example output of nvidia-smi

user@a0522 $ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A40                     On  | 00000000:01:00.0 Off |                  Off |
|  0%   45C    P8             150W / 300W |   1192MiB / 49140MiB |    100%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  0     N/A  N/A      48484     C   .../environments/demo/bin/python           1092MiB |
+---------------------------------------------------------------------------------------+

The output lists current GPU properties, like current utilization (top right), memory utilization (top center), and a list of processes utilizing the GPU.

Most useful is nvidia-smi, when you attach to a running job, to check the job's GPU utilization.

nvidia-smi also can continuously report information about GPU usage:

top like reporting

nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv -l 1

detailed reporting:

nvidia-smi --query-gpu=pci.bus_id,timestamp,pstate,temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1

`nvtop` GPU status viewer#

nvtop is a (h)top like task monitor for AMD and NVIDIA GPUs and shows GPU details like memory, utilization, temperature, etc. as well as information about the processes executing on the GPUs. nvtop is available as a module on Alex and TinyGPU.

NVIDIA Multi-Process Service#

The Multi-Process Service (MPS) is an alternative, binary-compatible implementation of the CUDA Application Programming Interface (API). The MPS runtime architecture is designed to transparently enable co-operative multi-process CUDA applications, typically MPI jobs. This can benefit performance when the GPU compute capacity is underutilized by a single application process.

Using MPS with single-GPU jobs#

# set necessary environment variables and start the MPS daemon
export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps.$SLURM_JOB_ID
export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log.$SLURM_JOB_ID
nvidia-cuda-mps-control -d

# do your work (a.out is just a placeholder)
./a.out -param 1 &
./a.out -param 2 & 
./a.out -param 3 & 
./a.out -param 4 & 
wait

# stop the MPS daemon
echo quit | nvidia-cuda-mps-control

Using MPS with multi-GPU jobs#

# set necessary environment variables and start the MPS daemon
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo "starting mps server for $GPU"
    export CUDA_VISIBLE_DEVICES=$GPU
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
    export CUDA_MPS_LOG_DIRECTORY=$TMPDIR/nvidia-log-${GPU}.$SLURM_JOB_ID
    nvidia-cuda-mps-control -d
done
# do your work - you may need to set CUDA_MPS_PIPE_DIRECTORY correctly per process!!
...
# cleanup MPS
for GPU in `nvidia-smi --format=csv,noheader --query-gpu=uuid`; do
    echo "stopping mps server for $GPU"
    export CUDA_MPS_PIPE_DIRECTORY=$TMPDIR/nvidia-mps-${GPU}.$SLURM_JOB_ID
    echo 'quit' | nvidia-cuda-mps-control
done

GPU-Profiling with NVIDIA tools#

NVIDIA offers two prominent profiling tools:

nsys (Nsight Systems) for profiling whole applications
ncu (Nsight Compute) for zeroing in on specific performance characteristics of single kernels

`nsys` - Nsight Systems#

An overview of application behavior can be obtained by running

nsys profile ./a.out

Transfer the resulting report file to your local machine and opening it with a local installation of Nsight Systems.

More command line options are available, as specified in the documentation. Some of the most relevant ones are:

--stats=true --force-overwrite=true -o my-profile

--stats=true: summarizes obtained performance data after the application has finished and prints it
-o: specifies the file name for the generated report, my-profile in this example
--force-overwrite=true: overwrite the report, if it already exists

A full example could be

nsys profile --stats=true --force-overwrite=true -o my-profile ./a.out

The resulting report files can grow quite large, depending on the application examined. Please make sure to store it on an appropriate filesystem.

`ncu` - Nsight Compute#

After getting an execution time overview, more in-depth analysis can be carried out by using Nsight Compute via

ncu ./a.out

which by default profiles all kernels in the application. This can be fine-tuned by providing options such as

--launch-skip 2 --launch-count 1

to skip the first two kernel launches and limit the number of profiled kernels to 1. Profiling can also be limited to specific kernels using

--kernel-name my_kernel

with an assumed kernel name of my_kernel. In most cases, specifying metrics to be measured is recommended as well, e.g. with

--metrics dram__bytes_read.sum,dram__bytes_write.sum

for the data volumes read and written from and to the GPU's main memory. Further information on available metrics can be found here and some key metrics are listed here.\ Other command line options can be reviewed in the documentation.

A full profiling call could be

ncu --kernel-name my_kernel --launch-skip 2 --launch-count 1 --metrics dram__bytes_read.sum,dram__bytes_write.sum ./a.out

LIKWID#

LIKWID 5.0 also supports NVIDIA GPUs. In order to simplify the transition from CPUs to GPUs for the users, the LIKWID API for GPUs is basically a copy of the LIKWID API for CPUs with a few differences. For the command line applications, new options are introduced. A tutorial on how to use LIKWID with NVIDIA GPUs can be found on the LIKWID GitHub page.