Skip to content

PyTorch#

This guide describes how to use and install PyTorch, a machine learning framework.

Install or build packages inside an interactive job on the target cluster

Install or build packages using an interactive job on the target cluster (Alex, Fritz, TinyGPU, Woody) to make sure GPU support and hardware features can be used properly.

For internet access on the compute node, configure a proxy by executing the following lines in your shell inside the interactive job:

export http_proxy=http://proxy:80
export https_proxy=http://proxy:80

Installation via pip/conda#

Preparation#

  1. Load the Python module by executing:

    module add python
    
    Without loading the Python module the system-installed Python will be used which is pretty old.

  2. Optional: create and activate a virtual environment:

Installation#

Configure installation command line.

The command line for installing PyTorch depends on several factors.
Go to: https://pytorch.org/get-started/locally/ and select:

  • PyTorch Build: Stable
  • Your OS: Linux
  • Package: Conda or Pip, depending on what package manager you want to use.
  • Language: Python
  • Compute Platform: choose a CUDA version

The command line to execute will be shown at the bottom of the table.

Test installation#

To check your PyTorch installation is functional and detects the GPU(s) execute the following command on a compute node after loading the Python module and activating the virtual environment you installed it to:

python3 -c 'import torch; import torch; print(torch.cuda.is_available())'
# output when GPUs are usable:
# True

Using Docker images#

It is possible to use pre-built Docker images of PyTorch.

On our systems we use Apptainer (previously known as Singularity) instead of Docker. Apptainer allows for downloading and converting a Docker container into its own sif file format.

From DockerHub#

To download and convert the latest PyTorch container from DockerHub run:

cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
apptainer pull pytorch-latest.sif docker://pytorch/pytorch:latest
rm -r "$APPTAINER_CACHEDIR"

Valid tags can be found under https://hub.docker.com/r/pytorch/pytorch/tags if you want to install a different one than latest.

From Nvidia NGC#

To download and convert the PyTorch container from Nvidia NGC run:

cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-23.11-py3.sif docker://nvcr.io/nvidia/pytorch:23.11-py3
rm -r "$APPTAINER_CACHEDIR"

To get the latest container, replace tag 23.11-py3 with the newest found on the tag page.

Using the imported container#

Within your job script, you use the container as follows:

apptainer exec pytorch-latest.sif ./script.py

In the container /home and /apps are available as they are automatically bind-mounted.

For a simple test run:

apptainer exec pytorch-latest.sif python3 -c 'import torch; import torch; print(torch.cuda.is_available())'
# output when GPUs are usable:
# INFO:    underlay of /etc/localtime required more than 50 (70) bind mounts
# INFO:    underlay of /usr/bin/nvidia-smi required more than 50 (283) bind mounts
# True

Increasing performance#

See PyTorch's Performance Tuning Guide. The guide already provides dozens of tips for optimizations.

Increasing performance with torch.compile#

PyTorch version >= 2.0 includes torch.compile for speeding up PyTorch scripts.

Usage of torch.compile is trivial (adapted from Getting Started):

import torch

device = "cuda" # assuming we have a GPU available

model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True).do(device)
# see https://pytorch.org/docs/stable/dynamo/get-started.html#existing-backends for other backends
model = torch.compile(model, backend="inductor").to(device)
model(torch.randn(1,3,64,64).to(device))

torch.compile uses TorchDynamo, a Python-level JIT compiler for speeding-up PyTorch scripts. TorchDynamo can use TorchInductor to use Triton as backend for execution on GPUs.

Nowadays when you install a recent PyTorch version (>= 2.0) TorchDynamo, TorchInductor, and Triton are automatically included.

More information is found in the PyTorch documentation, especially:

Parallel training by using multiple GPUs on a single or multiple nodes#

Distributed Data Parallel (DDP) is PyTorch's functionality to easily provide data parallel training. Hereby, the model is replicated across multiple GPUs on one or multiple nodes to speed up training.

DDP is different from PyTorch's Data Parallel (DP), which is deprecated. In contrast to PyTorch's DP that only is multi-threaded, PyTorch's DDP used multiple processes.

PyTorch already provides fundamental documentation for distributed training general:

Documentation related to Distributed Data Parallel:

DDP Requirements are:

  • Each process must call torch.distributed.init_process_group() (implicit or explicit).
  • Each process should use only one GPU, use:
  • Call torch.cuda.set_device(<logical GPU ID>)
  • Use env. var. CUDA_VISIBLE_DEVICES for each process.
  • User must define how data is distributed, e.g. via torch.utils.data.distributed.DistributedSampler.

DDP used the term process group that is comparable to an MPI communicator.

Backends#

DDP has multiple backends:

  • gloo: CPUs and GPUs
  • nccl: GPUs only, based on NCCL
  • MPI: but only when PyTorch was manually compiled to include it, does not come with the default installation.
  • Common environment variables.

In General nccl is preferred for GPUs as its faster and supports InfiniBand (IB).

Restrictions of nccl:

  • only one model/process per GPU with nccl
  • with gloo multiple processes per GPU are possible, but yields less performance

nccl backend uses the NCCL library:

ROCm currently only supports as distributed backend nccl and gloo. See the documentation.

Launch via torchrun#

You can launch your application via torchrun from PyTorch.

In case torchrun is not in your PATH environment variable, i.e., you cannot execute it directly, you can alternatively use:

python -m torch.distributed.run ...

Launch Python scripts via torchrun on our systems:

  • torchrun has to be started on each compute node.
  • For more than one compute node, srun can be used.
  • torchrun requires some flags to be adjusted for each node, see below in the example scripts.
  • e.g. --node-rank must be assigned a node rank

Notes on srun:

  • Add --cpu-bind=verbose to show affinity or set SLURM_CPU_BIND=verbose. Ideally you want to bind a process closest to the GPU it uses.
  • srun sets the following env. vars. that might be useful (see documentation for more):

  • SLURM_LAUNCH_NODE_IPADDR: IP address of the node where srun was launched from.

    • not sure if torchrun supports IPv6
  • SLURM_NODEID: Local node index, ranging from 0 to no. of nodes in the job - 1. Could be used as input for the --node-rank flag, when srun would call another wrapper script.
  • SLURM_PROCID: The MPI rank, ranging from 0 to no. of processes to be launched - 1.
  • SLURM_GPUS_ON_NODE: The no. of GPUs available on the current node.
  • SLURM_JOB_NUM_NODES: The total no. of nodes in this job.

Launch torchrun on each node manually#

The following shows a prototypical example of a job script where torchrun is used to launch a Python script:

  • For each GPU on each node an instance of the script is launched.
  • It is assumed each node contains the same no. of GPUs.

It might be required to change the value of MASTER_PORT, if it is already in use.

#!/bin/bash -l
#SBATCH ... all your typical sbatch options ...

# load modules, 
#   module add cuda/...
# activate environments

MASTER_PORT=29400

# optional: if torchrun script is not in your PATH
# TORCHRUN=python -m torch.distributed.run
# else
TORCHRUN=torchrun

# optional: enable logging and increase log level
# export TORCH_CPP_LOG_LEVEL=INFO
# export TORCH_DISTRIBUTED_DEBUG=DETAIL

declare -i i=0
for Host in $(scontrol show hostnames "$SLURM_JOB_NODELIST"); do
    if [ "$i" == 0 ]; then
        MASTER="$Host"
    fi

    srun -N 1 -w "$Host" --cpu-bind=verbose "$TORCHRUN" \
        --nnodes="$SLURM_JOB_NUM_NODES" \
        --nproc-per-node="$SLURM_GPUS_ON_NODE" \
        --master-addr="$MASTER" --master-port="$MASTER_PORT" --start-method=forkserver \
        --node-rank "$i" \
        script.py &

    ((++i))
done

wait

Launch torchrun on each node automatically#

Start all torchrun processes at once. Configure torchrun via env. vars. srun sets upon invocation, hence the indirection through running /bin/bash first.

It might be required to change the value of MASTER_PORT, if it is already in use.

#!/bin/bash -l
#SBATCH ... all your typical sbatch options ...

# load modules, 
#   module add cuda/...
# activate environments

# optional: if torchrun script is not in your PATH
# TORCHRUN=python -m torch.distributed.run
# else
TORCHRUN=torchrun

MASTER_PORT=29400

# optional: enable logging and increase log level
# export TORCH_CPP_LOG_LEVEL=INFO
# export TORCH_DISTRIBUTED_DEBUG=DETAIL

srun -N "$SLURM_JOB_NUM_NODES" --cpu-bind=verbose \
  /bin/bash -c "$TORCHRUN --nnodes=\$SLURM_JOB_NUM_NODES --nproc-per-node=\$SLURM_GPUS_ON_NODE --master-addr=\$SLURM_LAUNCH_NODE_IPADDR --master-port=\"$MASTER_PORT\" --start-method=forkserver --node-rank=\$SLURM_NODEID script.py"

Example script#

This tutorial on DDP covers multiple scenarios ranging from data parallel over model parallel and includes with several examples.

This documentation contains examples regarding PyTorch distributed

The following script can be used with torchrun and is taken and adapted with some output from PyTorch's documentation.

#!/usr/bin/env python3
# based on https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#initialize-ddp-with-torch-distributed-run-torchrun
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
import socket

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic():
    dist.init_process_group("nccl")
    rank = dist.get_rank()
    print(f"Start running basic DDP example on rank {rank}.", flush=True)

    # We distribute the GPUs on a node round robin over the proccesses.  Here
    # we assume the ranks on one node are consecutive.
    device_id = rank % torch.cuda.device_count()
    print(f"host: {socket.getfqdn()}  rank: {rank}  device_id: {device_id}", flush=True)

    # Create model and move it to GPU with id rank.
    model = ToyModel().to(device_id)
    ddp_model = DDP(model, device_ids=[device_id])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(device_id)
    loss_fn(outputs, labels).backward()
    optimizer.step()

if __name__ == "__main__":
    print("main", flush=True)
    demo_basic()