Batch System Slurm#

All clusters at NHR@FAU use the batch system Slurm for resource management and job scheduling. The compute nodes cannot be accessed directly. The batch system handles the queuing and priority of jobs according to the specified computational resources.

When logging into an HPC system, you are placed on a login node. From there, you can manage your data, set up your workflow, and prepare and submit jobs. The login nodes are not suitable for computational work!

This documentation gives you a general overview of how to use the Slurm batch system and is applicable to all clusters. For more cluster-specific information, consult the respective cluster documentation!

Batch job submission with `sbatch`#

At HPC systems, computational work and resource requirements are encapsulated into so-called batch jobs. This includes the following basic specifications:

resource requirements (number of nodes and cores, number of GPUs, ..)
job runtime (usually max. 24 hours)
setup of runtime environment (loading modules, activating environments, staging files, ...)
commands for application run

These specifications are normally written into so-called job scripts and submitted to the batch system by using

sbatch [options] <job_script>

After submission, sbatch will output the unique job ID of your job. It can be used to manage and control your jobs.

For TinyFat and TinyGPU, use the respective command wrapper sbatch.tinyfat/sbatch.tinygpu.

Interactive jobs with `salloc`#

For interactive work like debugging and testing, you can use interactive jobs to open an interactive shell on one of the compute nodes. On most clusters, some nodes are reserved for short jobs with less than one hour of runtime.

The command salloc is used to get an interactive shell on a compute node. salloc takes the same options as sbatch. After issuing salloc, do not close your terminal session but wait until the resources become available. You will directly be logged into the first granted compute node. When you close your terminal, the allocation will automatically be revoked. There is currently no way to request X11 forwarding to an interactive Slurm job.

To run an interactive job with Slurm on Meggie, Alex and Fritz:

salloc [options for number of nodes, walltime, etc.]

For TinyFat and TinyGPU, use the respective command wrapper salloc.tinyfat/salloc.tinygpu.

Settings from the calling shell (e.g. loaded module paths) will automatically be inherited by the interactive job. To avoid issues in your interactive job, purge all loaded modules via module purge before issuing the salloc command.

Run parallel applications with `srun`#

Use srun instead of mpirun to start MPI-parallel applications inside job allocations created with sbatch or salloc.

Options for `sbatch`/`salloc`/`srun`#

The following parameters can be specified as options for sbatch, salloc, andsrun or included in the job script by using the script directive #SBATCH:

Slurm option	Description
`-N`, `--nodes=<N>`	Number of requested nodes. Default: 1
`-n`, `--ntasks=<N>`	Total number of tasks (MPI processes). Can be omitted if `--nodes` and `--ntasks-per-node` are given. Default: 1
`--ntasks-per-node=<N>`	Number of tasks (MPI processes) per node.
`-c`, `--cpus-per-task=<N>`	Number of threads (logical cores) per task. Used for OpenMP or hybrid jobs. Typically should be equal to `OMP_NUM_THREADS`. (1)
`-t`, `--time=HH:MM:SS`	Specifies the required wall clock time (runtime). If you omit this option, a default time of 10 min will be used.
`-p`, `--partition=<name>`	Specifies the partition to which the job is submitted. If no partition is given, the default partition of the cluster is used.
`--job-name=<name>`	Specifies the name which is shown with `squeue`.
`--mail-user=<email>`	Get notified by mail on status changes of your job depending on the type you have specified.
`--mail-type=<type>`	On which status changes we will get notified. Valid options: `BEGIN`, `END`, `FAIL`, `TIME_LIMIT` and `ALL`.
`--exclusive`	Exclusive usage of requested compute nodes; you will be charged for all CPUs/cores/GPUs on the node.
`-a`, `--array=<arg>`	Submit an array job (examples)
`--constraint=hwperf`	Access to hardware performance counters (e.g. using `likwid-perfctr`). This option is not required for e.g. `likwid-pin` or `likwid-mpirun`.
`--export=none`	Only for `sbatch`. Use always together with `unset SLURM_EXPORT_ENV`. For more details see Environment export.

(1) NOTE: Beginning with Slurm 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s).

Many more options are available. For details, refer to the official Slurm documentation for sbatch, salloc or srun.

Job scripts - general structure#

A batch or job file is generally a script holding information like resource allocations, environment specifications, and commands to execute an application during the runtime of the job. The following example shows the general structure of a job script. More detailed examples are available in Job Script Examples.

#!/bin/bash -l                     # Interpreter directive; -l is necessary to initialize modules correctly!
#
#SBATCH --nodes=X                  # Resource requirements, job runtime, other options 
#SBATCH --ntasks=X                 #All #SBATCH lines have to follow uninterrupted
#SBATCH --time=hh:mm:ss            
#SBATCH --job-name=job123 
#SBATCH --export=NONE              # do not export environment from submitting shell
                                    # first non-empty non-comment line ends SBATCH options
unset SLURM_EXPORT_ENV             # enable export of environment from this script to srun

module load <modules>              # Setup job environment (load modules, stage data, ...)

srun ./application [options]       # Execute parallel application

Manage and control jobs#

Job and cluster status#

Slurm command	Description
`squeue <options>`	Displays status information on user's own jobs.
`scontrol show job <jobID>`	Displays very detailed information on job.
`sinfo`	Overview of cluster status. Shows partitions and availability of nodes.

In its last column, squeue will also state the reason why a job is not running. More information is available under Job priorities.

Editing jobs#

If your job is not running yet, it is possible to change details of the resource allocation, e.g. the runtime with

scontrol update TimeLimit=4:00:00 JobId=<jobID>

For more details and available options, see the official documentation.

Canceling jobs#

To cancel a job and remove it from the queue use scancel <jobID>. It will remove queued as well as running jobs. To cancel all your jobs at once use scancel -u <your_username>.

Attach to a running job#

Use the following command on the frontend node to attach to a specific running job:

srun --jobid=<jobID> --overlap --pty /bin/bash -l

Attaching to a running job can be used e.g. to check GPU utilization via nvidia-smi. For more information on nvidia-smi and GPU profiling, see Working with NVIDIA GPUs.

Slurm environment variables#

The Slurm scheduler typically sets environment variables to tell the job about what resources were allocated to it. These can also be used in batch scripts. A complete list can be found in the official Slurm documentation. The most useful are given below:

Variable name	Description
`$SLURM_JOB_ID`	Job ID
`$SLURM_SUBMIT_DIR`	Directory from which the job was submitted
`$SLURM_JOB_NODELIST`	List of nodes on which job runs
`$SLURM_JOB_NUM_NODES`	Number of nodes allocated to job
`$SLURM_CPUS_PER_TASK`	Number of cores per task; set `$OMP_NUM_THREADS` to this value for OpenMP/hybrid applications

Environment export#

SLURM automatically propagates environment variables that are set in the shell at the time of submission into the Slurm job. This includes paths set by currently loaded modules. To start the batch job with a clean environment, it is recommended to add #SBATCH --export=NONE and unset SLURM_EXPORT_ENV to the job script. The un-setting of SLURM_EXPORT_ENV inside the job script ensures propagation of all Slurm-specific variables and loaded modules to the srun call. If this is omitted, srun might not work as expected. Specifying export SLURM_EXPORT_ENV=ALL is equivalent to unset SLURM_EXPORT_ENV and can be used interchangeably.