LIKWID#
The LIKWID library and tool suite is developed in-house at NHR@FAU. It is available on all NHR@FAU and HPC4FAU systems as module. This is just a overview page, for the full documentation, check the LIKWID wiki.
If you want to use hardware performance counters on a compute node with any tool, specify --constraint=hwperf
on job submission with SLURM. This enables the tool to access the counters.
All modules provide the Fortran90 interface for the MarkerAPI. Support for accelerators is not enabled. Full feature set only enabled on the Testcluster.
likwid-topology
#
Get an overview of the system topology of a node.
System topology of a Fritz node
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
CPU type: Intel Icelake SP processor
CPU stepping: 6
********************************************************************************
Hardware Thread Topology
********************************************************************************
Sockets: 2
Cores per socket: 36
Threads per core: 1
--------------------------------------------------------------------------------
HWThread Thread Core Die Socket Available
0 0 0 0 0 *
1 0 1 0 0 *
2 0 2 0 0 *
3 0 3 0 0 *
4 0 4 0 0 *
5 0 5 0 0 *
6 0 6 0 0 *
7 0 7 0 0 *
8 0 8 0 0 *
9 0 9 0 0 *
10 0 10 0 0 *
11 0 11 0 0 *
12 0 12 0 0 *
13 0 13 0 0 *
14 0 14 0 0 *
15 0 15 0 0 *
16 0 16 0 0 *
17 0 17 0 0 *
18 0 18 0 0 *
19 0 19 0 0 *
20 0 20 0 0 *
21 0 21 0 0 *
22 0 22 0 0 *
23 0 23 0 0 *
24 0 24 0 0 *
25 0 25 0 0 *
26 0 26 0 0 *
27 0 27 0 0 *
28 0 28 0 0 *
29 0 29 0 0 *
30 0 30 0 0 *
31 0 31 0 0 *
32 0 32 0 0 *
33 0 33 0 0 *
34 0 34 0 0 *
35 0 35 0 0 *
36 0 36 0 1 *
37 0 37 0 1 *
38 0 38 0 1 *
39 0 39 0 1 *
40 0 40 0 1 *
41 0 41 0 1 *
42 0 42 0 1 *
43 0 43 0 1 *
44 0 44 0 1 *
45 0 45 0 1 *
46 0 46 0 1 *
47 0 47 0 1 *
48 0 48 0 1 *
49 0 49 0 1 *
50 0 50 0 1 *
51 0 51 0 1 *
52 0 52 0 1 *
53 0 53 0 1 *
54 0 54 0 1 *
55 0 55 0 1 *
56 0 56 0 1 *
57 0 57 0 1 *
58 0 58 0 1 *
59 0 59 0 1 *
60 0 60 0 1 *
61 0 61 0 1 *
62 0 62 0 1 *
63 0 63 0 1 *
64 0 64 0 1 *
65 0 65 0 1 *
66 0 66 0 1 *
67 0 67 0 1 *
68 0 68 0 1 *
69 0 69 0 1 *
70 0 70 0 1 *
71 0 71 0 1 *
--------------------------------------------------------------------------------
Socket 0: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 )
Socket 1: ( 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 )
--------------------------------------------------------------------------------
********************************************************************************
Cache Topology
********************************************************************************
Level: 1
Size: 48 kB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 16 ) ( 17 ) ( 18 ) ( 19 ) ( 20 ) ( 21 ) ( 22 ) ( 23 ) ( 24 ) ( 25 ) ( 26 ) ( 27 ) ( 28 ) ( 29 ) ( 30 ) ( 31 ) ( 32 ) ( 33 ) ( 34 ) ( 35 ) ( 36 ) ( 37 ) ( 38 ) ( 39 ) ( 40 ) ( 41 ) ( 42 ) ( 43 ) ( 44 ) ( 45 ) ( 46 ) ( 47 ) ( 48 ) ( 49 ) ( 50 ) ( 51 ) ( 52 ) ( 53 ) ( 54 ) ( 55 ) ( 56 ) ( 57 ) ( 58 ) ( 59 ) ( 60 ) ( 61 ) ( 62 ) ( 63 ) ( 64 ) ( 65 ) ( 66 ) ( 67 ) ( 68 ) ( 69 ) ( 70 ) ( 71 )
--------------------------------------------------------------------------------
Level: 2
Size: 1.25 MB
Cache groups: ( 0 ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) ( 7 ) ( 8 ) ( 9 ) ( 10 ) ( 11 ) ( 12 ) ( 13 ) ( 14 ) ( 15 ) ( 16 ) ( 17 ) ( 18 ) ( 19 ) ( 20 ) ( 21 ) ( 22 ) ( 23 ) ( 24 ) ( 25 ) ( 26 ) ( 27 ) ( 28 ) ( 29 ) ( 30 ) ( 31 ) ( 32 ) ( 33 ) ( 34 ) ( 35 ) ( 36 ) ( 37 ) ( 38 ) ( 39 ) ( 40 ) ( 41 ) ( 42 ) ( 43 ) ( 44 ) ( 45 ) ( 46 ) ( 47 ) ( 48 ) ( 49 ) ( 50 ) ( 51 ) ( 52 ) ( 53 ) ( 54 ) ( 55 ) ( 56 ) ( 57 ) ( 58 ) ( 59 ) ( 60 ) ( 61 ) ( 62 ) ( 63 ) ( 64 ) ( 65 ) ( 66 ) ( 67 ) ( 68 ) ( 69 ) ( 70 ) ( 71 )
--------------------------------------------------------------------------------
Level: 3
Size: 54 MB
Cache groups: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ) ( 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 )
--------------------------------------------------------------------------------
********************************************************************************
NUMA Topology
********************************************************************************
NUMA domains: 4
--------------------------------------------------------------------------------
Domain: 0
Processors: ( 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 )
Distances: 10 11 20 20
Free memory: 762.941 MB
Total memory: 128645 MB
--------------------------------------------------------------------------------
Domain: 1
Processors: ( 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 )
Distances: 11 10 20 20
Free memory: 163.5 MB
Total memory: 128976 MB
--------------------------------------------------------------------------------
Domain: 2
Processors: ( 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 )
Distances: 20 20 10 11
Free memory: 482.926 MB
Total memory: 129019 MB
--------------------------------------------------------------------------------
Domain: 3
Processors: ( 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 )
Distances: 20 20 11 10
Free memory: 235.504 MB
Total memory: 129017 MB
--------------------------------------------------------------------------------
likwid-pin
#
In order to control the process/thread affinity of multi-threaded applications, you can use likwid-pin
. It is recommended to use the affinity domains provided by LIKWID.
Affinity domain
An affinity domain is a group of hardware threads that share a topological entity like a CPU socket, the same last level cache segment or a NUMA domain. likwid-pin -p
lists all available affinity domains with the list of hardware threads.
List affinity domain with likwid-pin -p
on a Fritz compute node
Domain N:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Domain S0:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
Domain S1:
36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Domain D0:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
Domain D1:
36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Domain C0:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
Domain C1:
36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Domain M0:
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
Domain M1:
18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35
Domain M2:
36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53
Domain M3:
54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71
Example calls:
Running application pinned with likwid-pin
$ likwid-pin -c N:0-3 ./a.out
[... application output ...]
[... as soon as threads are started by the application ...]
[pthread wrapper]
[pthread wrapper] MAIN -> 0
[pthread wrapper] PIN_MASK: 0->1 1->2 2->3
[pthread wrapper] SKIP MASK: 0x0
threadid 22444399257472 -> hwthread 1 - OK
threadid 22444395055104 -> hwthread 2 - OK
threadid 22444390852736 -> hwthread 3 - OK
[... remaining application output ...]
likwid-perfctr
#
With likwid-perfctr
, you can asses the performance and other interesting metrics of your code or just regions in your code.
Performance groups
In order to combine an eventset with derived metrics and documentation, LIKWID provides so-called performance groups. You can get a list of all performance groups with likwid-perfctr -a
.
Measure double-precision flop rate for an application with likwid-perfctr
$ likwid-perfctr -C S1:0-3 -g FLOPS_DP Work/STREAM/stream_c.exe
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
CPU type: Intel Icelake SP processor
CPU clock: 2.39 GHz
--------------------------------------------------------------------------------
[... application output ...]
--------------------------------------------------------------------------------
Group 1: FLOPS_DP
+------------------------------------------+---------+--------------+--------------+--------------+--------------+
| Event | Counter | HWThread 36 | HWThread 37 | HWThread 38 | HWThread 39 |
+------------------------------------------+---------+--------------+--------------+--------------+--------------+
| INSTR_RETIRED_ANY | FIXC0 | 7856902947 | 7580792165 | 7499456536 | 7396211499 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 22162325013 | 22085846433 | 21647941053 | 21854674624 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 16484726784 | 16419237120 | 16422280224 | 16427031936 |
| TOPDOWN_SLOTS | FIXC3 | 110811625065 | 110429232165 | 108239705265 | 109273373120 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 1968750016 | 1758750000 | 1758750000 | 1758750000 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 4891 | 8 | 8 | 12 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 0 | 0 | 0 | 0 |
+------------------------------------------+---------+--------------+--------------+--------------+--------------+
+-----------------------------------------------+---------+--------------+--------------+--------------+--------------+
| Event | Counter | Sum | Min | Max | Avg |
+-----------------------------------------------+---------+--------------+--------------+--------------+--------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 30333363147 | 7396211499 | 7856902947 | 7.583341e+09 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 87750787123 | 21647941053 | 22162325013 | 2.193770e+10 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 65753276064 | 16419237120 | 16484726784 | 16438319016 |
| TOPDOWN_SLOTS STAT | FIXC3 | 438753935615 | 108239705265 | 110811625065 | 1.096885e+11 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE STAT | PMC0 | 7245000016 | 1758750000 | 1968750016 | 1811250004 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE STAT | PMC1 | 4919 | 8 | 4891 | 1229.7500 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE STAT | PMC2 | 0 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE STAT | PMC3 | 0 | 0 | 0 | 0 |
+-----------------------------------------------+---------+--------------+--------------+--------------+--------------+
+-------------------------+-------------+--------------+--------------+--------------+
| Metric | HWThread 36 | HWThread 37 | HWThread 38 | HWThread 39 |
+-------------------------+-------------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] | 7.0475 | 7.0475 | 7.0475 | 7.0475 |
| Runtime unhalted [s] | 9.2562 | 9.2243 | 9.0414 | 9.1277 |
| Clock [MHz] | 3218.9646 | 3220.6514 | 3156.2092 | 3185.4287 |
| CPI | 2.8207 | 2.9134 | 2.8866 | 2.9548 |
| DP [MFLOP/s] | 558.7098 | 499.1134 | 499.1134 | 499.1134 |
| AVX DP [MFLOP/s] | 0 | 0 | 0 | 0 |
| AVX512 DP [MFLOP/s] | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] | 279.3545 | 249.5567 | 249.5567 | 249.5567 |
| Scalar [MUOPS/s] | 0.0007 | 1.135155e-06 | 1.135155e-06 | 1.702732e-06 |
| Vectorization ratio [%] | 99.9998 | 100.0000 | 100.0000 | 100.0000 |
+-------------------------+-------------+--------------+--------------+--------------+
+------------------------------+------------+--------------+-----------+-----------+
| Metric | Sum | Min | Max | Avg |
+------------------------------+------------+--------------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 28.1900 | 7.0475 | 7.0475 | 7.0475 |
| Runtime unhalted [s] STAT | 36.6496 | 9.0414 | 9.2562 | 9.1624 |
| Clock [MHz] STAT | 12781.2539 | 3156.2092 | 3220.6514 | 3195.3135 |
| CPI STAT | 11.5755 | 2.8207 | 2.9548 | 2.8939 |
| DP [MFLOP/s] STAT | 2056.0500 | 499.1134 | 558.7098 | 514.0125 |
| AVX DP [MFLOP/s] STAT | 0 | 0 | 0 | 0 |
| AVX512 DP [MFLOP/s] STAT | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 1028.0246 | 249.5567 | 279.3545 | 257.0061 |
| Scalar [MUOPS/s] STAT | 0.0007 | 1.135155e-06 | 0.0007 | 0.0002 |
| Vectorization ratio [%] STAT | 399.9998 | 99.9998 | 100 | 99.9999 |
+------------------------------+------------+--------------+-----------+-----------+
Measure double-precision flop rate for a code region in an application with likwid-perfctr
This example measures a region named triad
.
$ likwid-perfctr -C S1:0-3 -g FLOPS_DP Work/STREAM/stream_c.exe
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
CPU type: Intel Icelake SP processor
CPU clock: 2.39 GHz
--------------------------------------------------------------------------------
[... application output ...]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Region triad, Group 1: FLOPS_DP
+-------------------+-------------+-------------+-------------+-------------+
| Region Info | HWThread 36 | HWThread 37 | HWThread 38 | HWThread 39 |
+-------------------+-------------+-------------+-------------+-------------+
| RDTSC Runtime [s] | 1.913508 | 1.914513 | 1.914566 | 1.914511 |
| call count | 50 | 50 | 50 | 50 |
+-------------------+-------------+-------------+-------------+-------------+
+------------------------------------------+---------+-------------+-------------+-------------+-------------+
| Event | Counter | HWThread 36 | HWThread 37 | HWThread 38 | HWThread 39 |
+------------------------------------------+---------+-------------+-------------+-------------+-------------+
| INSTR_RETIRED_ANY | FIXC0 | 2195740000 | 2193079000 | 2233596000 | 2213943000 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 5427263000 | 5476660000 | 5519122000 | 5519362000 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 4561411000 | 4567944000 | 4567976000 | 4568180000 |
| TOPDOWN_SLOTS | FIXC3 | 27136320000 | 27383300000 | 27595610000 | 27596810000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE | PMC0 | 875000000 | 875000000 | 875000000 | 875000000 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE | PMC1 | 600 | 600 | 600 | 600 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE | PMC2 | 0 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE | PMC3 | 0 | 0 | 0 | 0 |
+------------------------------------------+---------+-------------+-------------+-------------+-------------+
+-----------------------------------------------+---------+--------------+-------------+-------------+-------------+
| Event | Counter | Sum | Min | Max | Avg |
+-----------------------------------------------+---------+--------------+-------------+-------------+-------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 8836358000 | 2193079000 | 2233596000 | 2209089500 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 21942407000 | 5427263000 | 5519362000 | 5485601750 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 18265511000 | 4561411000 | 4568180000 | 4566377750 |
| TOPDOWN_SLOTS STAT | FIXC3 | 109712040000 | 27136320000 | 27596810000 | 27428010000 |
| FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE STAT | PMC0 | 3500000000 | 875000000 | 875000000 | 875000000 |
| FP_ARITH_INST_RETIRED_SCALAR_DOUBLE STAT | PMC1 | 2400 | 600 | 600 | 600 |
| FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE STAT | PMC2 | 0 | 0 | 0 | 0 |
| FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE STAT | PMC3 | 0 | 0 | 0 | 0 |
+-----------------------------------------------+---------+--------------+-------------+-------------+-------------+
+-------------------------+-------------+-------------+-------------+-------------+
| Metric | HWThread 36 | HWThread 37 | HWThread 38 | HWThread 39 |
+-------------------------+-------------+-------------+-------------+-------------+
| Runtime (RDTSC) [s] | 1.9135 | 1.9145 | 1.9146 | 1.9145 |
| Runtime unhalted [s] | 2.2667 | 2.2873 | 2.3051 | 2.3052 |
| Clock [MHz] | 2848.8314 | 2870.6490 | 2892.8856 | 2892.8822 |
| CPI | 2.4717 | 2.4972 | 2.4710 | 2.4930 |
| DP [MFLOP/s] | 914.5510 | 914.0709 | 914.0456 | 914.0718 |
| AVX DP [MFLOP/s] | 0 | 0 | 0 | 0 |
| AVX512 DP [MFLOP/s] | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] | 457.2753 | 457.0353 | 457.0226 | 457.0358 |
| Scalar [MUOPS/s] | 0.0003 | 0.0003 | 0.0003 | 0.0003 |
| Vectorization ratio [%] | 99.9999 | 99.9999 | 99.9999 | 99.9999 |
+-------------------------+-------------+-------------+-------------+-------------+
+------------------------------+------------+-----------+-----------+-----------+
| Metric | Sum | Min | Max | Avg |
+------------------------------+------------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] STAT | 7.6571 | 1.9135 | 1.9146 | 1.9143 |
| Runtime unhalted [s] STAT | 9.1643 | 2.2667 | 2.3052 | 2.2911 |
| Clock [MHz] STAT | 11505.2482 | 2848.8314 | 2892.8856 | 2876.3120 |
| CPI STAT | 9.9329 | 2.4710 | 2.4972 | 2.4832 |
| DP [MFLOP/s] STAT | 3656.7393 | 914.0456 | 914.5510 | 914.1848 |
| AVX DP [MFLOP/s] STAT | 0 | 0 | 0 | 0 |
| AVX512 DP [MFLOP/s] STAT | 0 | 0 | 0 | 0 |
| Packed [MUOPS/s] STAT | 1828.3690 | 457.0226 | 457.2753 | 457.0923 |
| Scalar [MUOPS/s] STAT | 0.0012 | 0.0003 | 0.0003 | 0.0003 |
| Vectorization ratio [%] STAT | 399.9996 | 99.9999 | 99.9999 | 99.9999 |
+------------------------------+------------+-----------+-----------+-----------+
likwid-mpirun
#
With likwid-mpirun
, you can pin the application and asses the performance and other interesting metrics of your code or just regions in your code in MPI+X applications.
It is recommended to use likwid-mpirun
in interactive jobs and disable all other affinity mechanisms.
Although likwid-mpirun
tries to detect the current environment (inside a SLURM job, etc.), it is recommended to specify the MPI starter explicitly with --mpi
:
- SLURM: --mpi=slurm
(uses srun
, recommended on NHR@FAU systems)
- OpenMPI: --mpi=openmpi
(uses Open MPI's mpirun
)
- Intel MPI: --mpi=intelmpi
(uses mpiexec.hydra
or mpiexec
)
Pinning of an MPI only application with likwid-mpirun
Pinning of an MPI+X application with likwid-mpirun
$ salloc -N 2 --exclusive -t 01:00:00
$ likwid-mpirun --mpi slurm -np 2 -t 2 ./a.out
Process with rank 0 running on Node f1255.nhr.fau.de core 0/0 with pid 366207
Process with rank 1 running on Node f1256.nhr.fau.de core 0/0 with pid 316748
Rank 1 Thread 0 running on Node f1256.nhr.fau.de core 0/0 with pid 316748 and tid 316748
Rank 1 Thread 1 running on Node f1256.nhr.fau.de core 1/0 with pid 316748 and tid 316767
Rank 0 Thread 0 running on Node f1255.nhr.fau.de core 0/0 with pid 366207 and tid 366207
Rank 0 Thread 1 running on Node f1255.nhr.fau.de core 1/0 with pid 366207 and tid 366229
Pin & measure an MPI+X application with likwid-mpirun
$ salloc -N 2 --exclusive -t 01:00:00
$ likwid-mpirun --mpi slurm -np 2 -t 2 -g L2 ./a.out
+--------------------------------+-----------+-----------+-----------+-----------+
| Metric | f1255:0:0 | f1255:0:1 | f1256:1:0 | f1256:1:1 |
+--------------------------------+-----------+-----------+-----------+-----------+
| Runtime (RDTSC) [s] | 3.8216 | 3.8216 | 3.6860 | 3.6860 |
| Runtime unhalted [s] | 1.0284 | 0.0163 | 0.0436 | 0.0066 |
| Clock [MHz] | 2327.8493 | 831.3929 | 1297.9806 | 867.2679 |
| CPI | 0.2681 | 2.1271 | 0.6677 | 4.5594 |
| L2D load bandwidth [MBytes/s] | 54.1332 | 3.5695 | 55.7258 | 0.3442 |
| L2D load data volume [GBytes] | 0.2069 | 0.0136 | 0.2054 | 0.0013 |
| L2D evict bandwidth [MBytes/s] | 9.0930 | 1.8510 | 8.9326 | 0.2003 |
| L2D evict data volume [GBytes] | 0.0347 | 0.0071 | 0.0329 | 0.0007 |
| L2 bandwidth [MBytes/s] | 88.0416 | 8.3671 | 91.8875 | 1.7409 |
| L2 data volume [GBytes] | 0.3365 | 0.0320 | 0.3387 | 0.0064 |
+--------------------------------+-----------+-----------+-----------+-----------+
If you encounter problems with likwid-mpirun
, please contact hpc-support@fau.de and attach the output of your command with the additional -d
command line switch.