Advanced topics Slurm#

Array jobs#

Array jobs can be used to submit multiple jobs that share the same parameters, like executable and resource requirements. They can be controlled and monitored as a single unit.

The Slurm option is -a or --array=<idxs>. The following specifications are possible for the array index values idxs:

comma separated list, e.g., --array=0,1,2,17,
range based, e.g., --array=0-15,
mix of comma separated and range base, e.g., --array=0,1,10-12,
step based, e.g., --array=0-15:4.

A maximum number of simultaneously running array tasks is specified using the % separator, e.g. --array=0-19%5 creates 20 tasks from which 5 are allowed to run at the same time.

Within the job, two specific environment variables are available: $SLURM_ARRAY_JOB_ID is set to the first job ID of the array and $SLURM_ARRAY_TASK_ID is set for each array element. You can use these variables inside the job script to distinguish between array elements.

More information is available in the official Slurm documentation.

Chain jobs#

In some cases, you want to automatically submit a subsequent job after the current run has finished. This can be achieved by calling sbatch in your job script. However, if something goes wrong during startup or runtime, the job will resubmit itself indefinitely until it is manually canceled. To prevent this from happening, the job should only be resubmitted if it has run for a sufficiently long time. The following snippet can be used:

#!/bin/bash -l                     

if [ "$SECONDS" -gt "3600" ]; then 
  cd ${SLURM_SUBMIT_DIR}             
  sbatch job_script                  
fi

The bash environment variable $SECONDS contains the run time of the shell in seconds. In this example, sbatch is only called when the previous job script has run for at least 1 hour.

On the TinyX clusters, sbatch has to be used instead of sbatch.tinyx within the job script.

Chain jobs with dependencies#

Dependencies can be used if your job relies on the results of one or more preceding jobs. Slurm has an option -d, --dependency=<dependency_list> to specify that a job is only allowed to start if the specified conditions are satisfied.

--dependency=afterany:job_id[:job_id] will start the job start when all of the listed jobs have terminated.

There are a number of other possible specifications for <dependency_list>. For full details, please consult the official Slurm documentation.

Node features#

Some cluster nodes have specific properties that are used to distinguish them from others in the same partition. One example are the different CPU generations in Woody. Nodes with a specific feature are requested with the Slurm option -C or --constraint for salloc or sbatch.

Multiple features can be combined with AND, OR, etc. For more details, please refer to the Slurm documentation.

Job priorities#

The batch system assigns a priority to each waiting job. This priority value depends on certain parameters like waiting time, partition, user group, and recently used CPU time (a.k.a. fairshare). The ordering of waiting jobs listed by squeue does not reflect the priority of jobs.

If your job is not starting straight away, the reason is listed in the output of squeue in the column NODELIST(REASON).

Some of the most common reasons:

Reason	Description
`Priority`	One or more higher priority jobs are queued.
`Dependency`	This job is waiting for a dependent job to complete.
`Resources`	The job is waiting for resources to become available.
`AssociationGroup<Resource>Limit`	All resources assigned to your association/group are currently in use. (1)
`QOSGrp<Resource>Limit`	All resources assigned to the specified QoS are currently in use. (1)
`Partition<Resource>Limit`	All resources assigned to the specified partition are currently in use.
`ReqNodeNotAvail`	A node specifically required by the job is not currently available. It may be necessary to cancel the job and rerun with other node specifications.

(1) <Resource> can be a limit for any generic resource ( number of GRES (GPUs), Nodes, CPUs or concurrent running/queued jobs), e.g. AssocGrpGRES specifies that all GPUs assigned to your association or group are currently in use.

Exclusive jobs for benchmarking#

On some HPC clusters, compute nodes are shared among multiple users and jobs. Resources like GPUs and compute cores are never shared. In some cases, e.g. for benchmarking, exclusive access to the compute node can be desired. This can be achieved by using the Slurm parameter --exclusive.

Setting --exclusive only makes sure that there will be no other jobs running on your nodes. It will not automatically give you access to all resources of the node without explicitly requesting them. This means if you need them, you still have to specify e.g. the number of GPUs via the gres parameter and all cores of a node via --ntasks or --cpus-per-task.

Independent of your resource allocation and usage, exclusive jobs will be billed with all available resources of the node.

Specific clock frequency#

By default, the compute nodes at NHR@FAU run with enabled turbo mode and the "OnDemand" governor. If you need a fixed CPU frequency, e.g. for benchmarking, it can be set by Slurm. Note that you cannot make the CPUs go any faster, only slower, as the default already is the turbo mode. likwid-setFrequencies is not supported on the clusters.

The frequency (in kHz) can be specified using the --cpu-freq option of srun:

$ srun --cpu-freq=1800000-1800000:performance <more-srun-options> ./a.out <arguments>

You can also use the --cpu-freq option with salloc or sbatch; however, the frequency will only be set once srun is called!

Access to hardware performance counters#

Access to hardware performance counters (e.g. for using likwid-perfctr) is not enabled by default. The Slurm option --constraint=hwperf has to be used to enable it. On clusters with non-exclusive nodes, the additional option --exclusive is necessary.

Enabling access to hardware performance counters will automatically disable these metrics in the Job monitoring system.