PyTorch#
This guide describes how to use and install PyTorch, a machine learning framework.
Install or build packages inside an interactive job on the target cluster
Install or build packages using an interactive job on the target cluster (Alex, Fritz, TinyGPU, Woody) to make sure GPU support and hardware features are properly autodetected on initial installation.
For internet access on the compute node, configure a proxy by executing the following lines in your shell inside the interactive job:
Installation via pip/conda#
Preparation#
- Start interactive Job on a cluster node. See cluster documentation page for hints.
-
Load the Python module by executing:
Without loading the Python module the system-installed Python will be used which is pretty old. -
Optional: create and activate a virtual environment:
- For conda see conda environments.
- For Python
venv
see Virtual environments withvenv
.
Installation#
The command line for installing PyTorch depends on several factors.
Go to: https://pytorch.org/get-started/locally/
and select:
- PyTorch Build: Stable
- Your OS: Linux
- Package: Conda or Pip, depending on what package manager you want to use.
- Language: Python
- Compute Platform: choose a CUDA version
The command line to execute will be shown at the bottom of the table.
Test installation#
To check your PyTorch installation is functional and detects the GPU(s) execute the following command on a compute node after loading the Python module and activating the virtual environment you installed it to:
python3 -c 'import torch; import torch; print(torch.cuda.is_available())'
# output when GPUs are usable:
# True
Using Docker images#
It is possible to use pre-built Docker images of PyTorch.
On our systems we use Apptainer (previously known as Singularity)
instead of Docker.
Apptainer allows for downloading and converting a Docker container into
its own sif
file format.
From DockerHub#
To download and convert the latest PyTorch container from DockerHub run:
cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
apptainer pull pytorch-latest.sif docker://pytorch/pytorch:latest
rm -r "$APPTAINER_CACHEDIR"
Valid tags can be found under https://hub.docker.com/r/pytorch/pytorch/tags if you want to install a different one than latest
.
From Nvidia NGC#
To download and convert the PyTorch container from Nvidia NGC run:
cd $WORK
export APPTAINER_CACHEDIR=$(mktemp -d)
singularity pull tensorflow-ngc-23.11-py3.sif docker://nvcr.io/nvidia/pytorch:23.11-py3
rm -r "$APPTAINER_CACHEDIR"
To get the latest container, replace tag 23.11-py3
with the newest found on the tag page.
Using the imported container#
Within your job script, you use the container as follows:
In the container /home
and /apps
are available as they are automatically
bind-mounted.
For a simple test run:
apptainer exec pytorch-latest.sif python3 -c 'import torch; import torch; print(torch.cuda.is_available())'
# output when GPUs are usable:
# INFO: underlay of /etc/localtime required more than 50 (70) bind mounts
# INFO: underlay of /usr/bin/nvidia-smi required more than 50 (283) bind mounts
# True
Increasing performance#
See PyTorch's Performance Tuning Guide. The guide already provides dozens of tips for optimizations.
Increasing performance with torch.compile
#
PyTorch version >= 2.0 includes torch.compile
for speeding up PyTorch scripts.
Usage of torch.compile
is trivial (adapted from Getting Started):
import torch
device = "cuda" # assuming we have a GPU available
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True).do(device)
# see https://pytorch.org/docs/stable/dynamo/get-started.html#existing-backends for other backends
model = torch.compile(model, backend="inductor").to(device)
model(torch.randn(1,3,64,64).to(device))
torch.compile
uses TorchDynamo,
a Python-level JIT compiler for speeding-up PyTorch scripts. TorchDynamo can use TorchInductor to use Triton as backend for execution on GPUs.
Nowadays when you install a recent PyTorch version (>= 2.0) TorchDynamo, TorchInductor, and Triton are automatically included.
More information is found in the PyTorch documentation, especially:
Parallel training by using multiple GPUs on a single or multiple nodes#
Distributed Data Parallel (DDP) is PyTorch's functionality to easily provide data parallel training. Hereby, the model is replicated across multiple GPUs on one or multiple nodes to speed up training.
DDP is different from PyTorch's Data Parallel (DP), which is deprecated. In contrast to PyTorch's DP that only is multi-threaded, PyTorch's DDP used multiple processes.
PyTorch already provides fundamental documentation for distributed training general:
- Distributed training
- Overview over
torch.distributed
Documentation related to Distributed Data Parallel:
- Launching and configuring distributed data parallel applications
- Single- and Multi-process Data Loading
- Examples from PyTorch, also including a data partitioner.
- Tutorial, which covers multiple scenarios from data parallel over model parallel and includes with several examples.
torch.nn.parallel.DistributedDataParallel
torch.utils.data.distributed.DistributedSampler
DDP Requirements are:
- Each process must call
torch.distributed.init_process_group()
(implicit or explicit). - Each process should use only one GPU, use:
- Call
torch.cuda.set_device(<logical GPU ID>)
- Use env. var.
CUDA_VISIBLE_DEVICES
for each process. - User must define how data is distributed, e.g. via
torch.utils.data.distributed.DistributedSampler
.
DDP used the term process group that is comparable to an MPI communicator.
Backends#
DDP has multiple backends:
gloo
: CPUs and GPUsnccl
: GPUs only, based on NCCL- MPI: but only when PyTorch was manually compiled to include it, does not come with the default installation.
- Common environment variables.
In General nccl
is preferred for GPUs as its faster and supports InfiniBand (IB).
Restrictions of nccl
:
- only one model/process per GPU with
nccl
- with
gloo
multiple processes per GPU are possible, but yields less performance
nccl
backend uses the NCCL library:
- Documentation at Nvidia.
- Environment variables
ROCm currently only supports as distributed backend nccl
and gloo
. See the documentation.
Launch via torchrun
#
You can launch your application via torchrun
from PyTorch.
In case torchrun
is not in your PATH
environment variable, i.e., you cannot execute it directly, you can alternatively use:
Launch Python scripts via torchrun
on our systems:
torchrun
has to be started on each compute node.- For more than one compute node,
srun
can be used. torchrun
requires some flags to be adjusted for each node, see below in the example scripts.- e.g.
--node-rank
must be assigned a node rank
Notes on srun
:
- Add
--cpu-bind=verbose
to show affinity or setSLURM_CPU_BIND=verbose
. Ideally you want to bind a process closest to the GPU it uses. -
srun
sets the following env. vars. that might be useful (see documentation for more): -
SLURM_LAUNCH_NODE_IPADDR
: IP address of the node wheresrun
was launched from.- not sure if
torchrun
supports IPv6
- not sure if
SLURM_NODEID
: Local node index, ranging from 0 to no. of nodes in the job - 1. Could be used as input for the--node-rank
flag, whensrun
would call another wrapper script.SLURM_PROCID
: The MPI rank, ranging from 0 to no. of processes to be launched - 1.SLURM_GPUS_ON_NODE
: The no. of GPUs available on the current node.SLURM_JOB_NUM_NODES
: The total no. of nodes in this job.
Launch torchrun
on each node manually#
The following shows a prototypical example of a job script where torchrun
is used to launch a Python script:
- For each GPU on each node an instance of the script is launched.
- It is assumed each node contains the same no. of GPUs.
It might be required to change the value of MASTER_PORT
, if it is already in use.
#!/bin/bash -l
#SBATCH --ntasks-per-node=1
#SBATCH ... all your typical sbatch options ...
# load modules,
# module add cuda/...
# activate environments
MASTER_PORT=29400
# optional: if torchrun script is not in your PATH
# TORCHRUN=python -m torch.distributed.run
# else
TORCHRUN=torchrun
# optional: enable logging and increase log level
# export TORCH_CPP_LOG_LEVEL=INFO
# export TORCH_DISTRIBUTED_DEBUG=DETAIL
declare -i i=0
for Host in $(scontrol show hostnames "$SLURM_JOB_NODELIST"); do
if [ "$i" == 0 ]; then
MASTER="$Host"
fi
srun -N 1 -w "$Host" --cpu-bind=verbose "$TORCHRUN" \
--nnodes="$SLURM_JOB_NUM_NODES" \
--nproc-per-node="$SLURM_GPUS_ON_NODE" \
--master-addr="$MASTER" --master-port="$MASTER_PORT" --start-method=forkserver \
--node-rank "$i" \
script.py &
((++i))
done
wait
Launch torchrun
on each node automatically#
Start all torchrun
processes at once. Configure torchrun
via env. vars. srun
sets upon invocation, hence the indirection through running /bin/bash
first.
It might be required to change the value of MASTER_PORT
, if it is already in use.
#!/bin/bash -l
#SBATCH --ntasks-per-node=1
#SBATCH ... all your typical sbatch options ...
# load modules,
# module add cuda/...
# activate environments
# optional: if torchrun script is not in your PATH
# TORCHRUN=python -m torch.distributed.run
# else
TORCHRUN=torchrun
MASTER_PORT=29400
# optional: enable logging and increase log level
# export TORCH_CPP_LOG_LEVEL=INFO
# export TORCH_DISTRIBUTED_DEBUG=DETAIL
srun -N "$SLURM_JOB_NUM_NODES" --cpu-bind=verbose \
/bin/bash -c "$TORCHRUN --nnodes=\$SLURM_JOB_NUM_NODES --nproc-per-node=\$SLURM_GPUS_ON_NODE --master-addr=\$SLURM_LAUNCH_NODE_IPADDR --master-port=\"$MASTER_PORT\" --start-method=forkserver --node-rank=\$SLURM_NODEID script.py"
Example script#
This tutorial on DDP covers multiple scenarios ranging from data parallel over model parallel and includes with several examples.
This documentation contains examples regarding PyTorch distributed
The following script can be used with torchrun
and is taken and adapted with some output from PyTorch's documentation.
#!/usr/bin/env python3
# based on https://pytorch.org/tutorials/intermediate/ddp_tutorial.html#initialize-ddp-with-torch-distributed-run-torchrun
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
import socket
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic():
dist.init_process_group("nccl")
rank = dist.get_rank()
print(f"Start running basic DDP example on rank {rank}.", flush=True)
# We distribute the GPUs on a node round robin over the proccesses. Here
# we assume the ranks on one node are consecutive.
device_id = rank % torch.cuda.device_count()
print(f"host: {socket.getfqdn()} rank: {rank} device_id: {device_id}", flush=True)
# Create model and move it to GPU with id rank.
model = ToyModel().to(device_id)
ddp_model = DDP(model, device_ids=[device_id])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(device_id)
loss_fn(outputs, labels).backward()
optimizer.step()
if __name__ == "__main__":
print("main", flush=True)
demo_basic()