Quantum Espresso#
Quantum Espresso is an integrated suite of open-source computer codes for electronic-structure calculations and materials modeling at the nanoscale. It is based on density-functional theory, plane waves, and pseudopotentials.
Availability / Target HPC systems#
- parallel computers: main target machines
- throughput cluster Woody: might be useful for small systems, manually distributed phonon calculations
Notes on parallelization in general#
- please note that QE has five command-line arguments that can be provided to the binary at run time:
-nimage
,-npools
,-nband
,-ntg
,-ndiag
(the short hands, respectively:-ni
,-nk
,-nb,
-nt
,-nd
). They can influence the run time considerably. - try to stick to one k-point / node
- do not use Hyperthreading (disabled on most systems of NHR@FAU anyways)
- use image parallelization e.g. for NEB / phonon calculations via the use of
-ni
- ask for help with the parallelization of phonon calculation
- use gamma point version (
KPOINTS GAMMA
) instead ofKPOINTS AUTOMATIC
- k-point parallelization:
-
- 1 k-point per node .e.g.
-nk #nnodes
- 1 k-point per node .e.g.
-
-nk
must be a divisor of #MPI tasks
-nd
for#bands > 500
-nt 2,5,10
as a last resort only, and ifnr3 < #MPI tasks
,nr3
is the third dimension of the FFT mesh
Sample job scripts#
MPI job (single-node) on Fritz#
#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=72
#SBATCH --partition=singlenode
#SBATCH --time=01:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
module load qe/7.1
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
srun pw.x -i input.in >output_filename
Hybrid OpenMP/MPI job (multi-node) on Fritz#
#!/bin/bash -l
#SBATCH --nodes=8
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=18
#SBATCH --partition=multinode
#SBATCH --time=01:00:00
#SBATCH --export=NONE
unset SLURM_EXPORT_ENV
module load qe/7.1
# set number of threads to requested cpus-per-task
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "OMP_NUM_THREADS=$OMP_NUM_THREADS"
export SRUN_CPUS_PER_TASK=$SLURM_CPUS_PER_TASK
export OMP_PLACES=cores
export OMP_PROC_BIND=true
srun pw.x -i input.in >output_filename
Performance tests for Quantum Espresso 7.1 on Fritz#
We performed the calculations using the binary file from module qe/7.1 for the ground state structure of sodium chloride, namely rock salt, downloaded from The Materials Project All wave-function optimizations of our single-point runs were converged in 14 iterations without enforcing the number of SCF iterations. The calculations are performed at the level of PBE exchange-correlation functional with PAW file (downloaded from PseudoDojo) which has nine valence electrons for sodium and seven for chlorine.
- System:
- Single point calculations
- Supercell containing 512 atoms
- Gamma point k-points
ecutwfc=36.0
,ecutrho = 144.0
,conv_thr = 1.0d-11
,mixing_beta = 0.7
- None of the performance-related arguments (mentioned at the top of this page) was used. The program makes choices that may not be the most optimal ones. For example, the default choice of QE for our system regarding the sub-groups in the diagonalization was ScaLAPACK distributed-memory algorithm (size of sub-group: 8*8 procs), it is not an optimal setup. In our benchmark, we compare the relative run time for different combinations of MPI processes and OpenMP threads and a perfect choice of the QE performance parameters would be system dependent as well as being a complicated task. Therefore, we do not tune the performance-related parameters. Nevertheless, we encourage users to tune the five parameters in a production run, in particular, if it is a computationally demanding run or a large set of similar small-scale individual runs. Please note that the following graph should be considered as a qualitative behavior of the parallel performance of QE.
Per-node speedup is defined as reference time divided by the product of the time of run and the number of nodes in each calculation i.e. Tref /(T*nodes). Tref is the time of calculations on one node with only MPI.