Advanced topics Slurm#
Array jobs#
Array jobs can be used to submit multiple jobs that share the same parameters, like executable and resource requirements. They can be controlled and monitored as a single unit.
The Slurm option is -a
or --array=<idxs>
. The following specifications are possible for the array index values idxs
:
- comma separated list, e.g.,
--array=0,1,2,17
, - range based, e.g.,
--array=0-15
, - mix of comma separated and range base, e.g.,
--array=0,1,10-12,
- step based, e.g.,
--array=0-15:4
.
A maximum number of simultaneously running array tasks is specified using the %
separator, e.g. --array=0-19%5
creates 20 tasks from which 5 are allowed to run at the same time.
Within the job, two specific environment variables are available: $SLURM_ARRAY_JOB_ID
is set to the first job ID of the array and $SLURM_ARRAY_TASK_ID
is set for each array element. You can use these variables inside the job script to distinguish between array elements.
More information is available in the official Slurm documentation.
Chain jobs#
In some cases, you want to automatically submit a subsequent job after the current run has finished. This can be achieved by calling sbatch
in your job script. However, if something goes wrong during startup or runtime, the job will resubmit itself indefinitely until it is manually canceled. To prevent this from happening, the job should only be resubmitted if it has run for a sufficiently long time. The following snippet can be used:
The bash environment variable $SECONDS
contains the run time of the shell in seconds. In this example, sbatch
is only called when the previous job script has run for at least 1 hour.
On the TinyX clusters, sbatch
has to be used instead of sbatch.tinyx
within the job script.
Chain jobs with dependencies#
Dependencies can be used if your job relies on the results of one or more preceding jobs. Slurm has an option -d
, --dependency=<dependency_list>
to specify that a job is only allowed to start if the specified conditions are satisfied.
--dependency=afterany:job_id[:job_id]
will start the job start when all of the listed jobs have terminated.
There are a number of other possible specifications for <dependency_list>
. For full details, please consult the official Slurm documentation.
Node features#
Some cluster nodes have specific properties that are used to distinguish them from others in the same partition. One example are the different CPU generations in Woody. Nodes with a specific feature are requested with the Slurm option -C
or --constraint
for salloc
or sbatch
.
Multiple features can be combined with AND, OR, etc. For more details, please refer to the Slurm documentation.
Job priorities#
The batch system assigns a priority to each waiting job. This priority value depends on certain parameters like waiting time, partition, user group, and recently used CPU time (a.k.a. fairshare). The ordering of waiting jobs listed by squeue
does not reflect the priority of jobs.
If your job is not starting straight away, the reason is listed in the output of squeue
in the column NODELIST(REASON
).
Some of the most common reasons:
Reason |
Description |
---|---|
Priority |
One or more higher priority jobs are queued. |
Dependency |
This job is waiting for a dependent job to complete. |
Resources |
The job is waiting for resources to become available. |
AssociationGroup<Resource>Limit |
All resources assigned to your association/group are currently in use. (1) |
QOSGrp<Resource>Limit |
All resources assigned to the specified QoS are currently in use. (1) |
Partition<Resource>Limit |
All resources assigned to the specified partition are currently in use. |
ReqNodeNotAvail |
A node specifically required by the job is not currently available. It may be necessary to cancel the job and rerun with other node specifications. |
(1) <Resource>
can be a limit for any generic resource ( number of GRES (GPUs), Nodes, CPUs or concurrent running/queued jobs), e.g. AssocGrpGRES
specifies that all GPUs assigned to your association or group are currently in use.
Exclusive jobs for benchmarking#
On some HPC clusters, compute nodes are shared among multiple users and jobs. Resources like GPUs and compute cores are never shared. In some cases, e.g. for benchmarking, exclusive access to the compute node can be desired. This can be achieved by using the Slurm parameter --exclusive
.
Setting --exclusive
only makes sure that there will be no other jobs running on your nodes. It will not automatically give you access to all resources of the node without explicitly requesting them. This means if you need them, you still have to specify e.g. the number of GPUs via the gres
parameter and all cores of a node via --ntasks
or --cpus-per-task
.
Independent of your resource allocation and usage, exclusive jobs will be billed with all available resources of the node.
Specific clock frequency#
By default, the compute nodes at NHR@FAU run with enabled turbo mode and the "OnDemand" governor. If you need a fixed CPU frequency, e.g. for benchmarking, it can be set by Slurm. Note that you cannot make the CPUs go any faster, only slower, as the default already is the turbo mode. likwid-setFrequencies
is not supported on the clusters.
The frequency (in kHz) can be specified using the --cpu-freq
option of srun
:
You can also use the --cpu-freq
option with salloc
or sbatch
; however, the frequency will only be set once srun
is called!
Access to hardware performance counters#
Access to hardware performance counters (e.g. for using likwid-perfctr
) is not enabled by default. The Slurm option --constraint=hwperf
has to be used to enable it. On clusters with non-exclusive nodes, the additional option --exclusive
is necessary.
Enabling access to hardware performance counters will automatically disable these metrics in the Job monitoring system.