Continuous Integration / GitLab Cx#
NHR@FAU provides continuous integration for HPC-related software projects developed on one of the GitLab instances at RRZE: (gitlab.rrze.fau.de or gitos.rrze.fau.de).
Access to the GitLab Runner is restricted. Moreover, every job on the HPC systems has to be associated with an HPC user account.
The Cx jobs run on the Testcluster provided by HPC4FAU and NHR@FAU.
A Hands-On talk on Cx was given at the HPC-Cafe (October 19, 2021): Slides & Additional Slides
Note
With the license downgrade of gitlab.rrze.fau.de at June 22, 2022, the pull mirroring feature is disabled. If you need synchronization between external repositories (e.g. GitHub) for CI, see here)
Prerequisites:#
- Valid HPC account at NHR@FAU
- Create SSH key pair for authentication of the GitLab Runner. We recommend creating a separate SSH key pair without passphrase for GitLab CI only.
- Request Cx usage by mail at the HPC user support with- your HPC account name
- the URL to the repository
- the public key
 
Preparing GitLab repositories:#
- Configure SSH authentication for the HPC Cx service. In the repository go to Settings -> CI/CD -> Variablesand add two variables:- AUTH_USER: The name of your HPC account. No newline.
- AUTH_KEY: The whole content of the private SSH key file. The key is not shown in the logs but is visible for all maintainers of the project! It has to have the header (- --- BEGIN ... ---) and trailer (- --- END ... ---)
 
- Enable the HPC runner for the repository at Settings -> CI/CD -> Runnerand flip the switch atEnable shared runners for this project. The HPC Runner has thetestclustertag.
Define jobs using the HPC Cx service#
Jobs for CI/CD in GitLab are defined in the file .gitlab-ci.yml in the
top level of the repository. In order to run on the HPC system, the jobs
need the tag testcluster. The tag tells the system on which runner the
job can be executed.
To define where and how the job is run, the following variables are available:
| Variable | Value | Changeable | Description | 
|---|---|---|---|
| SLURM_PARTITION | work | No | Specify the set of nodes which should be used for the job. We currently allow Cx jobs only in the workpartition | 
| SLURM_NODES | 1 | No | Only single-node jobs are allowed at the moment. | 
| SLURM_TIMELIMIT | 120 | Yes (values 1-120allowed) | Specify the maximal runtime of a job. | 
| SLURM_NODELIST | broadep2 | Yes, to any hostname in the system, see here | Specify the host for the the job. | 
You only need to specify a host in SLURM_NODELIST if you want to test
different architecture-specific build options or optimizations.
In order to change one of the settings globally, you can overwrite them globally for all jobs:
SLURM options can be set globally in the variables section to apply to all jobs:
variables:
    SLURM_TIMELIMIT: 60
    SLURM_NODELIST: rome1
job1:
    [...]
    tags:
        - testcluster
job2:
    [...]
    tags:
        - testcluster
The options can also be specified for each job individually. This will overwrite the global settings.
The Cx system uses the salloc command to submit the jobs to
the batch system. All available environment variables for salloc can be applied here.
An example would be SLURM_MAIL_USER to get notified by the system.
Note
If you want to run on the frontend node testfront instead of a compute
node, you can specify the variable NO_SLURM_SUBMIT: 1. This is
commonly not what you want!
It may happen that your CI job fails if the node is occupied with other jobs for more than 24 hours. In that case, simply restart the CI job.
Examples:#
Build on default node with default time limit (120 min.)
stages:
    - build
    - test
build:
    stage: build
    script:
        - export NUM_CORES=$(nproc --all)
        - mkdir $CI_PROJECT_DIR/build
        - cd $CI_PROJECT_DIR/build
        - cmake ..
        - make -j $NUM_CORES
    tags:
        - testcluster
    artifacts:
        paths:
            - build
test:
    stage: test
    variables: 
        SLURM_TIMELIMIT: 30
    script:
        - cd $CI_PROJECT_DIR/build
        - ./test
    tags:
        - testcluster
Build on default node with default time limit, enable LIKWID (hwperf) and run one job on frontend
variables:
    SLURM_CONSTRAINT: "hwperf"
stages:
    - prepare
    - build
    - test
prepare:
    stage: prepare
    script:
        - echo "Preparing on frontend node..."
    variables:
        NO_SLURM_SUBMIT: 1
    tags:
        - testcluster
build:
    stage: build
    script:
        - export NUM_CORES=$(nproc --all)
        - mkdir $CI_PROJECT_DIR/build
        - cd $CI_PROJECT_DIR/build
        - cmake ..
        - make -j $NUM_CORES
    tags:
        - testcluster
    artifacts:
        paths:
            - build
test:
    stage: test
    variables: 
        SLURM_TIMELIMIT: 30
    script:
        - cd $CI_PROJECT_DIR/build
        - ./test
    tags:
        - testcluster
Build and test stage on a specific node and use a custom default time limit
variables:
    SLURM_NODELIST: broadep2
    SLURM_TIMELIMIT: 10
stages:
    - build
    - test
build:
    stage: build
    script:
        - export NUM_CORES=$(nproc --all)
        - mkdir $CI_PROJECT_DIR/build
        - cd $CI_PROJECT_DIR/build
        - cmake ..
        - make -j $NUM_CORES
    tags:
        - testcluster
    artifacts:
        paths:
            - build
test:
    stage: test
    variables: 
        SLURM_TIMELIMIT: 30
    script:
        - cd $CI_PROJECT_DIR/build
        - ./test
    tags:
        - testcluster
Build and benchmark on multiple nodes
stages:
    - build
    - benchmark
.build:
    stage: build
    script:
        - export NUM_CORES=$(nproc --all)
        - mkdir $CI_PROJECT_DIR/build
        - cd $CI_PROJECT_DIR/build
        - cmake ..
        - make -j $NUM_CORES
    tags:
        - testcluster
    variables: 
        SLURM_TIMELIMIT: 10
    artifacts:
        paths:
            - build
.benchmark:
    stage: benchmark
    variables: 
        SLURM_TIMELIMIT: 20
    script:
        - cd $CI_PROJECT_DIR/build
        - ./benchmark
    tags:
        - testcluster
# broadep2
build-broadep2:
    extends: .build
    variables:
        SLURM_NODELIST: broadep2
benchmark-broadep2:
    extends: .benchmark
    dependencies:
        - build-broadep2
    variables:
        SLURM_NODELIST: broadep2
# naples1
build-naples1:
    extends: .build
    variables:
        SLURM_NODELIST: naples1
benchmark-naples1:
    extends: .benchmark
    dependencies:
        - build-naples1
    variables:
        SLURM_NODELIST: naples1
Parent-child pipelines for dynamically creating jobs
In order to create a child pipeline we have to dynamically create a YAML
file that is compatible for the GitLab-CI system. The dynamically
created file is only valid for the current Cx execution. The YAML file
can be either created by a script that is part of the repository like
the .ci/generate_jobs.sh script in the example below. There are
different methods to create the YAML file for the child pipeline
(multi-line script
entry,
templated
job
with variable overrides, ...).
$ cat .ci/generate_jobs.sh
#!/bin/bash -l
# Get list of modules
MODLIST=$(module avail -t intel 2>&1 | grep -E "^intel" | awk '{print $1}')
# Alternative: Get list of idle hosts in the testcluster (requires NO_SLURM_SUBMIT=1)
#HOSTLIST=$(sinfo -t idle -h --partition=work -o "%n %t" | grep "idle" | cut -d ' ' -f 1)
for MOD in ${MODLIST}; do
  MODVER=${MOD/\//-}   # replace '/' in module name with '-' for job name
  cat << EOF
build-$MODVER:
  stage: build
  variables:
    CUDA_MODULE: $MOD
  script:
    - module load "\$CUDA_MODULE"
    - make
    - ./run_tests
  tags:
    - testcluster
  EOF
done
NO_SLURM_SUBMIT=1 to generate the
pipeline on the frontend node. In some cases, you have to use a specific
system (e.g. CUDA modules only usable on the host medusa), then just
use the SLURM_NODELIST variable. We store the generated YAML file as
artifact in the generator job and include it as trigger in the executor.
If you want to use artifacts in the child pipeline that are created in
the parent pipeline (like differently configured builds), you have to
specify the variable PARENT_PIPELINE_ID=$CI_PIPELINE_ID and specify
the pipeline in the child job
(job -> needs -> pipeline: $PARENT_PIPELINE_ID).
generate_child_pipeline:
    stage: build
    tags:
        - testcluster
    variables:
        - NO_SLURM_SUBMIT: 1
    script: 
        - .ci/generate_jobs.sh > child-pipeline.yml
    artifacts:
    paths:
        - child-pipeline.yml
execute_child_pipeline:
    stage: test
    trigger:
    include:
        - artifact: child-pipeline.yml
        job: generate_child_pipeline
    strategy: depend
    variables:
        PARENT_PIPELINE_ID: $CI_PIPELINE_ID
Disclaimer#
Be aware that
- the private SSH key is visible by all maintainers of your project. Best is to have only a single maintainer and all others are developers.
- the CI jobs can access data ($HOME,$WORK, ...) of the CI user.
- BIOS and OS settings of Testcluster nodes can change without notification.