Skip to main content

CI & Automation

This page has information and answers questions around how the C8 monorepo CI and related tools like Renovate work. It should serve as a knowledge base including FAQ for Camunda and external contributors.

Git Branches

  • main: permanent branch for feature development of next C8 minor version (GitHub default branch)
  • stable/*: long-lived branches for maintenance of past C8 minor versions (deleted on support end)
  • release*: short-lived branches for release activities (helps to achieve code freeze) created from main or stable/*
  • any other branch: (short-lived) branches for feature development to be merged using Pull Requests, via merge queues

Available SNAPSHOT Artifacts

Maven artifacts are available on Artifactory and Docker images are available on DockerHub:

  • Pushed commits to main branch produce:
    • Maven artifacts with version 8.10.0-SNAPSHOT for all C8 components
    • Docker images with tags X.Y-SNAPSHOT and SNAPSHOT for Camunda, Operate, Tasklist, and Zeebe
    • Docker images with tags X.Y-SNAPSHOT and 8-SNAPSHOT for Optimize
  • Pushed commits to stable/8.9 branch produce:
    • Maven artifacts with version 8.9.0-SNAPSHOT for Optimize, Operate, Tasklist, Zeebe
    • Docker images with tag 8.9-SNAPSHOT for Optimize, Operate, Tasklist, Zeebe
  • Pushed commits to stable/8.8 branch produce:
    • Maven artifacts with version 8.8.0-SNAPSHOT for Optimize, Operate, Tasklist, Zeebe
    • Docker images with tag 8.8-SNAPSHOT for Optimize, Operate, Tasklist, Zeebe
  • Pushed commits to stable/8.7 branch produce:
    • Maven artifacts with version 8.7.0-SNAPSHOT for Operate, Tasklist, Zeebe
    • Docker images with tag 8.7-SNAPSHOT for Operate, Tasklist, Zeebe
  • Pushed commits to stable/optimize-8.7 branch produce:
    • Maven artifacts with version 8.7.0-SNAPSHOT for Optimize
    • Docker images with tag 8.7-SNAPSHOT for Optimize

Issue Tracking

All problems, bugs and feature requests regarding the CI of the C8 monorepo CI are tracked using GitHub Issues.

For visibility and prioritization there is the Monorepo CI project board that tracks high-level issues.

New reports of issues need to be checked against the GitHub Issues to avoid duplication: new occurrences of existing issues need to reported in comments, otherwise raise a new issue labelled area/build or reach out via Slack to the Monorepo CI DRI.

Related resources:

Prioritization

Prioritization of issues is done by the Monorepo CI DRI according to severity which follows from these criteria:

  1. Impacted functionality:
  2. Amount of users impacted:
    • Generally severity scales with the amount of affected people that interact with the monorepo (Camundi/external contributors)
    • Can be assessed with CI health or on anecdotal level
  3. Available workarounds:
    • Severity is lower if a workaround is available, especially if that workaround is easy to use/low effort

Dealing with reported issues that are identified as urgent/high severity:

  • Communicate the degraded functionality/impact and that there is an ongoing investigation to affected people.
  • Debug problems on GitHub Actions level yourself, involve the stakeholder teams (via their medic) or subject matter experts for advice on technical details.
  • Try to identify a (limited) workaround to unblock users.
  • Communicate any workarounds and resolution of the problem.

FAQ

Q: What do I do when I see the CI failing with a seemingly unrelated error?

A: Search the open GitHub Issues with the failure message to see if the problem is known: If you find an issue for the same problem, leave a comment with the new occurrance. Otherwise raise a new issue labelled area/build to start tracking that CI failure or reach out via Slack to the Monorepo CI DRI.

Q: How to deal with flaky tests that block CI?

A: Disable the flaky test(s) and comment on existing ticket or create a new one that the flaky test needs to be re-enabled after fixing it. No single test can be more important that the stability of the remaining CI system impacting dozens of developers.

GitHub Merge Queue

GitHub Merge Queue helps automate the Pull Request (PR) merging process by creating a temporary branch for each batch of PRs, running checks against the latest target branch, and merging changes only if the checks pass, ensuring a more streamlined and error-free workflow.

Merge queues exist per branch (one for main, one for stable/8.7, etc.) in the C8 monorepo CI and are configured independently via rulesets. So different branches can have different required status checks to control which CI workflows must be green to allow merging.

Related resources:

FAQ

Q: Why do we use merge queues instead of manually merging PRs?

A: In repositories like the C8 monorepo with a high number of contributing engineers and high development velocity dozens of Pull Requests can get created and merged each day. Avoiding downtimes like waiting for a window to merge PRs boosts productivity and allows us to scale.

Q: Why do we have required status checks for PRs and merge queues?

A: Automated software tests increase our confidence into delivering a working software product. Required status checks are a way to technically ensure that engineers get early feedback about potential problems. This way we only merge Pull Requests to main branch that will not fail those automated tests impacting quality or other engineers. This also helps with automerging dependency updates using Renovate.

Q: What are the current required status checks for PRs and merge queue to main?

A: You can find the up2date list here:

Q: Do those required status checks of main guarantee that all commits are green?

A: Yes, for the scope of the Unified CI except for an admin bypass of the merge queue in case of incidents.

Q: My PR had only green checks when I queued it, why was it removed from the merge queue?

A: The merge queue creates a temporary branch from the latest target branch (e.g. main) with your PR merged and then runs CI again. Your changes could be incompatible with the target branch or CI failed e.g. due to flakiness. Look up the check results for details on the CI failure.

Unified CI

"Unified CI" is the name of an approach to establish one central CI pipeline that runs checks for code changes of the whole monorepo instead of multiple unrelated, side-by-side pipelines for each component in the monorepo.

Goals

This central pipeline will use change detection to run checks only when needed thus improving runtime and lowering cost. After migrating, the central CI pipeline will be the only GitHub required status check for PRs and merge queue to main thus improving UX and preventing edge cases with multiple checks and path filters.

This central pipeline will run on all Pull Requests, the merge queue to main and on push for main (and in the future other stable branches). Out of scope are scheduled and release workflows.

This topic is work-in-progress as part of #17721 to migrate remaining workflows once they meet certain criteria (short runtimes under 10 minutes and low flakiness) to the central CI.

A full run of the central CI pipeline should take ideally around 15 minutes with individual jobs only taking 10 minutes of runtime at most.

GitHub Actions pipeline code should should be de-duplicated for the same task and moved out from ci.yml into other reusable workflows named ci-<subtask>.yml or composite actions to keep the ci.yml short and lean.

Non-goal: Workflows that don't trigger on code changes will not be part of the Unified CI, like scheduled workflows.

Workflow Inclusion Criteria

Workflows that seek inclusion to the Unified CI (and thus GitHub required status checks) need to fulfill the following criteria and best practices:

If the required short runtime cannot be achieved, consider moving long-running tests into nightly jobs or standalone workflows that are no required status checks and don't run in the merge queue (to preserve merge velocity).

Implementation

This section explains how to achieve including a CI check into the Unified CI as a required status check so that it is executed only when relevant files changed in a PR.

To include a workflow fitting the criteria into the Unified CI all of the following steps have to be taken for each job of that workflow:

  1. Change Detection: Define path filter for all file changes that should trigger the new job in this composite action. This information is relevant for the next step, make sure to:
    • Add a new output to the composite action representing the condition when the new job should be triggered.
    • The output have the same name as the new job it triggers.
    • The output condition should reuse existing filters and combine them as needed.
    • If no matching existing filter, add a new one in a step.
    • Adjust the detect-changes job to re-expose the new output under the same name.
  2. CI Check: Relies on the previous step to run the new job only if relevant files changed. Add the new job defintion to the ci.yml file, by:
    • Following this pattern:

      descriptive-job-name:
      # reuse information from change detection on whether to run this job
      if: needs.detect-changes.outputs.descriptive-job-name == 'true'
      needs: [detect-changes]
      runs-on: ubuntu-latest # or other
      timeout-minutes: 30 # or less
      permissions: {} # unless GITHUB_TOKEN is needed
      steps:
      - uses: actions/checkout@v4
      #
      # ...ACTUAL CI CHECK STEPS HERE...
      #
      - name: Observe build status
      if: always()
      continue-on-error: true
      uses: ./.github/actions/observe-build-status
      with:
      build_status: ${{ job.status }}
      secret_vault_address: ${{ secrets.VAULT_ADDR }}
      secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
      secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}
    • It is important to depend on the detect-changes job and use the newly defined output as a condition.

    • If the new job has many steps, you need to refactor them into a reusable workflow or composite action to keep ci.yml lean.

    • Adding observability for CI health is required.

  3. Results Check: Include the new job as needs dependency in check-results job (required status check). This is needed so that the Unified CI is marked as failure if one of its jobs fails.

Related resources:

CI Test Files

Ownership

Each CI test file has an owning team. The owning team can be found either through the CODEOWNERS file or on the metadata in the file itself. The CODEOWNERS file is organized and broken down by team, any additions to the file should follow that convention. The metadata on a GHA workflow file is used by a scraping tool so that it is easy to gather information about the current state of CI. You can look at the metadata for a quick overview of the owning team, where the tests live, how the test is called, and a description of what the file is actually testing

Metadata follows this structure and is placed at the beginning of a GHA workflow file

# <Description of what the GHA is running and what is being tested>
# test location: <The filepath of the tests being run>
# owner: <The name of the owning team>

Legacy CI

"Legacy CI" is a name for CI tests that has not been migrated to the Unified CI. Legacy tests do not meet the inclusion criteria for Unified CI, such as running under 10 minutes.

Tests that are marked as Legacy are to be migrated to Unified CI by the owning team in the future. Once migrated, the test should live inside the ci.yml file, or be part of a workflow file that is called by it. The label of "Legacy" should be removed as well

Consolidated Unit Tests

The Consolidated Unit Test job in the Unified CI runs unit tests by team and component. (For example, Operate tests owned by the Data Layer team). These tests are run by JUnit5 Suites. Each suite selects which tests to run by package. This enables the CI job to run a sub-set of all tests in a module, so that the tests being run are relevant to the owning team. Any new package for tests should be added to the relevant suite.

Suite names must follow a naming convention of {componentName}{team}TestSuite. The composite of the component and and the team is used by the CI job to select which component and team to run the tests for. For example, OperateCoreFeaturesTestSuite is used to run Core Features tests on Operate

Naming Conventions

Names for CI tests are composed by Github Actions, which is a combination of CI job names. The composed name is shown on PRs and when viewing an individual test run in the Github UI. The composed name should follow the below naming convention to ensure consistency, clarity across the CI system, make it easy to identify the owning team, and which component is being tested.

For Names of tests in the Unified CI, the name should be structured as follows:

CI / <componentName> / [<testType>] <testName> / <ownerName> / ...

testType can be things like: UT for Unit Tests, IT for Integration Tests, Smoke for smoke tests, etc.

For example, Core Features Unit Tests for Tasklist would be appear as

CI / Tasklist / [UT] Core Features / Run Unit Tests

Importer Integration Tests for Operate would appear as

CI / Operate / [IT] Importer Tests / Data Layer / run-test

For Names for Legacy tests should be prefixed with [Legacy] <componentName> so that Legacy tests are organized and appear together when run on a PR. The rest of the name should be descriptive of what the test is doing.

Renovate

Renovate is a bot and GitHub app that automates dependency updates in software projects by scanning the source code for outdated libraries and applications, then creating Pull Requests to upgrade them to the latest versions, which helps keeping the project secure and up-to-date.

Renovate supports many package ecosystems of which we use e.g. Maven, NPM, Docker and Helm. It can scan multiple branches (e.g. main, stable/8.5) inside of one repository and raise PRs independently for those.

Renovate is configured via a JSON configuration file on the main branch. In general we allow Renovate to run and create PRs at any time to avoid lagging behind with updates.

We also want Renovate to automatically merge dependency updates when CI is green and automated tests are passing. Assuming a nearly complete test coverage the efficiency gains outweigh the risks. This is achieved by Renovate requesting to put every Pull Requests into the GitHub Merge Queue - GitHub will then ensure that required status checks pass before merging the PR.

We additionally use the renovate-approve bot to circumvent the PR reviewer requirements.

Pull Request labels that have a special meaning for Renovate:

  • dependencies: added by our Renovate configuration to designate dependency PRs
  • automerge: added by our Renovate configuration to designate dependency PRs that should get automatically merged
  • area/security: added by our Renovate configuration to designate dependency PRs that fix a security vulnerability
  • stop-updating: can be added by humans to tell Renovate to not rebase an open PR anymore (if the change is breaking anyways)

Related resources:

FAQ

Q: Why do we use Renovate instead of manually looking for dependency updates?

A: We automate repetitive and error-prone tasks as much as possible to save valuable Engineering time for solving problems requiring more creativity, e.g. complex major version upgrades of dependencies.

Q: Why do we use Renovate instead of Dependabot etc.?

A: Renovate is more flexible, supports more package ecosystems, has a detailed configuration and already used successfully in other places in Camunda so we can reuse existing experience.

Q: Why does Renovate attempt to merge a PR with failing status checks?

A: Renovate will always try to automerge dependency update PRs since it does not know about CI failures. It is GitHub's task to enforce required status checks and reject the merge attempt - as long as no PR with failing status check gets actually merged, everything is working as intended.

Q: Why does Renovate not detect dependency XYZ?

A: Renovate parses and analyzes most well known dependency management files (e.g. pom.xml) automatically. Not detecting a dependency can be due to an unrecognized file format, a typo in the name, a bug in Renovate or the dependency being missing from the package ecosystem. This will usually be reported in the Renovate logs.

Q: How to access the Renovate logs?

A: Click on the most recent run in the Renovate Dashboard and make sure to show debug information.

Q: Why are updates for dependency XYZ ignored in the Renovate configuration file?

A: The reasons for manually ignoring certain updates should be described in the comments. Using git annotate to figure out who put the ignore can also be a way to get more details.

CI Health Metrics

There are hundreds of CI jobs running each day in the C8 monorepo CI due to high development activity. This scale makes it challenging to assess whether there are any structural problems related to the "CI health" (e.g. reliability issues) that would impact developer productivity.

To achieve that for CI jobs we can collect metrics like build times, build failures and information about the hardware/runner via the CI Analytics framework. See how to instrument GHA workflows for metrics collection. We use the collected data for visualizations to get an overview of the CI health.

This topic is work-in-progress as part of #18210 to achieve better coverage, collect more different metrics for additional insights and establish a process for dealing with the results.

Metrics Collection

Any job in any GitHub Actions workflow can be instrumented to collect information about the build status by adding one step at the end, like the following snippet shows:

jobs:
my-solo-job-name:
steps:
# initial checkout is required!
- uses: actions/checkout@v4
# keep all other steps here, then insert final step:
- name: Observe build status
if: always()
continue-on-error: true
uses: ./.github/actions/observe-build-status
with:
build_status: ${{ job.status }}
secret_vault_address: ${{ secrets.VAULT_ADDR }}
secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}

Special handling has to be done for matrix jobs since the job name is not unique among the different matrix builds, see below:

jobs:
my-matrix-job-name:
strategy:
matrix:
identifier: [configurationA, configurationB]
steps:
# initial checkout is required!
- uses: actions/checkout@v4
# keep all other steps here, then insert final step:
- name: Observe build status
if: always()
continue-on-error: true
uses: ./.github/actions/observe-build-status
with:
job_name: "${{ env.GITHUB_JOB }}/${{ matrix.identifier }}"
build_status: ${{ job.status }}
secret_vault_address: ${{ secrets.VAULT_ADDR }}
secret_vault_roleId: ${{ secrets.VAULT_ROLE_ID }}
secret_vault_secretId: ${{ secrets.VAULT_SECRET_ID }}

Related resources:

Visualization

We visualize the collected data using an internal Grafana dashboard to analyze for high build failure rates in general and breakdowns per CI job.

Related resources:

CI Secret Management

All GitHub Action workflows of the C8 monorepo CI must use Vault to retrieve secrets e.g. with the Hashicorp Vault action as a best practice. Other approaches like GitHub Action Secrets will be sunset (outside of bootstrapping connection to Vault).

Historically, different paths have been used in Vault to store secrets depending on the managing team, e.g. products/zeebe/ci or products/operate/ci. This scheme can lead to redundancies in a monorepo and should be aligned for more synergy.

Secrets for the C8 monorepo CI should be stored in Vault under the path products/camunda/ci/*. Manually managed secrets should go into products/camunda/ci/github-actions.

Related resources:

CI Self-Hosted Runners

GitHub offers customers to use their own machines to execute GitHub Action workflows via self-hosted runners. We use this feature in cases when more resources are needed than what GitHub can provide or at a cheaper price. See the internal documentation for what is available.

Usage Guidelines

How to choose which runner to use for a GHA workflow:

  1. Use GitHub-hosted runners by default (free for public repositories)
  2. Use self-hosted runners (with -default name suffix) when a workflow needs:
    1. more resources (memory, CPU) than available on GitHub-hosted runners
    2. ARM CPU architecture

The -default self-hosted runners have no durability guarantees which makes them very cheap and the default choice, if GitHub-hosted runners are not sufficient. Exceptions: in case of reliability problems one can use the -longrunning suffix after approval by Monorepo CI DRI.

GitHub Actions Cache

Workflows run by GitHub Actions can avoid repeated downloads of tools and dependencies by using GitHub Actions Cache. This can shorten or avoid download times, make workflow executions faster, more robust and cheaper.

There is a web UI available and a CLI which can be used to view, analyze and delete corrupted cache entries.

Important facts about the GHA cache:

  • size: 10 GiB
    • this is small for a monorepo usecase with Java, NodeJS, and many open PRs
  • access restrictions: workflow runs can restore caches created in either the current branch or the default branch
    • caches created for main are very useful
    • caches created on other branches/Pull Requests are of very limited use, can only be used on subsequent builds of same PR
  • cleaning policy: GitHub will immediately delete old cache entries when we exceed the 10 GiB total size
    • counter-intuitive: caches from main (more useful than those from PRs) are deleted first if they are the oldest

Metrics on cache usage are available in CI Health Grafana dashboard (internal).

Caching Strategy

To make the most efficient use of the limited GHA cache resources available in the monorepo and ensure consistency across many GHA workflows, we follow these guidelines:

  1. Docker/BuildKit layers: don't write to the GHA cache
  2. Java/Maven dependencies: do write to the GHA cache only from main and stable* branch builds
  3. NPM/Yarn dependencies: do write to the GHA cache only from main and stable* branch builds
  4. Golang dependencies: do write to the GHA cache only from main and stable* branch builds
  5. CodeQL automation by GitHub: writes to the GHA cache from Pull Requests (speeds up analysis)

Implementation:

  1. Do not use cache-from: type=gha and cache-to: type=gha parameters of docker/build-push-action.
  2. Use setup-maven-cache action.
  3. Use setup-yarn-cache action, see usage example in #21607.
  4. No implementation since Golang usage is low.

Disable cache restoration for a Pull Request

You can temporarily turn off cache restore functionality in a PR by using the /ci-disable-cache command as described under ChatOps. This could be useful to test GHA workflows without the caching mechanism. To restore standard functionality, you need to issue the /ci-enable-cache command or drop the empty commit.

Note: Disabling cache restore mechanism is only possible on PRs.

CI Security

Permissions of GITHUB_TOKEN

Every GHA workflow job is given a GITHUB_TOKEN environment variable with a valid GitHub API token by default. This token can have wide permissions which are unnecessary but open up attack surface, reducing security.

Best Practice: All GHA workflow jobs must request only actually required permissions on the GITHUB_TOKEN. Set permissions: {} by default and add what is needed.

Usage of Third Party GitHub Actions

GitHub Actions has a large ecosystem of existing useful actions from GitHub and third parties such as other companies and individuals. While reusing existing actions avoids code duplication and maintenance effort for Camunda, it increases the attack surface should any of those actions be hacked to perform malicious tasks.

Best Practice: To balance utility with risk, all GHA workflows must follow this policy:

  • Use the same action for the same (or similar) automation task, see recipes.
  • Use actions only from trusted sources (GitHub or small set of select 3rd parties, settings).
  • Move actions from Camundi personal accounts to camunda for long-term maintenance, or find replacement.

For camunda/camunda GHA workflows we use a GitHub feature to technically limit which actions can be used to:

  • Allow actions created by GitHub in the actions and github organizations.
  • Allow actions in any Camunda GitHub Enterprise organization like camunda, bpmn-io, etc.
  • Allow specific actions from 3rd parties that we need (full list see below).

If you need to use a 3rd party action not on the list, ask the Monorepo DevOps team via Slack and explain the motivation.

Details

List of allowed 3rd party actions and reusable workflows EnricoMi/publish-unit-test-result-action@, YunaBraska/java-info-action@, asdf-vm/actions/install@, atomicjar/testcontainers-cloud-setup-action@, aws-actions/configure-aws-credentials@, blombard/move-to-next-iteration@, bobheadxi/deployments@, browser-actions/setup-firefox@, bufbuild/buf-action@, cloudposse/github-action-matrix-outputs-read@, cloudposse/github-action-matrix-outputs-write@, clowdhaus/argo-cd-action@, codex-/return-dispatch@, dcarbone/install-jq-action@, deadsnakes/action@, dlavrenuek/conventional-changelog-action@, docker/build-push-action@, docker/login-action@, docker/metadata-action@, docker/setup-buildx-action@, docker/setup-qemu-action@, dorny/paths-filter@, fjogeleit/http-request-action@, geekyeggo/delete-artifact@, golangci/golangci-lint-action@, google-github-actions/auth@, google-github-actions/get-gke-credentials@, google-github-actions/setup-gcloud@, hadolint/hadolint-action@, hashicorp/setup-terraform@, hashicorp/vault-action@, hoverkraft-tech/compose-action@, jamesives/github-pages-deploy-action@, joelanford/go-apidiff@, jwalton/gh-docker-logs@, korthout/backport-action@, lewagon/wait-on-check-action@, marocchino/sticky-pull-request-comment@, mavrosxristoforos/get-xml-info@, misiekhardcore/infra-report-action@, mshick/add-pr-comment@, mxschmitt/action-tmate@, ncipollo/release-action@, nick-fields/retry@, octokit/@, peaceiris/actions-gh-pages@, peter-evans/create-or-update-comment@, peter-evans/find-comment@, peter-evans/slash-command-dispatch@, redhat-actions/oc-login@, requarks/changelog-action@, rodrigo-lourenco-lopes/move-to-current-iteration@, rossjrw/pr-preview-action@, s4u/maven-settings-action@, s4u/setup-maven-action@, slackapi/slack-github-action@, snyk/actions/setup@, stCarolas/setup-maven@, stefanzweifel/git-auto-commit-action@, teleport-actions/auth-k8s@, teleport-actions/auth@, teleport-actions/setup@, test-summary/action@, tibdex/github-app-token@, wagoid/commitlint-github-action@,

Preview Environments

Engineers can request Preview Environments for specific Pull Requests of the C8 monorepo to be available via a designated URL, to allow more thorough testing and demonstration of the product features before the feature branches are merged into the base branch. For the C8 monorepo the components Identity, Operate, Optimize, Tasklist and Zeebe will get provisioned based on the camunda-platform Helm chart.

Assign the deploy-preview label to any PR to request creation of a Preview Environment. The following base branches are supported, with their corresponding Helm chart versions:

  • main (uses Helm chart version camunda-platform-8.8-13.x)
  • stable/8.8 (uses Helm chart version camunda-platform-8.8-13.x)
  • stable/8.7 (uses Helm chart version camunda-platform-8.7-12.x)
  • stable/optimize-8.7 (uses Helm chart version camunda-platform-8.7-12.x)

Creation may take a while and a PR comment including an URL and additional info will be sent as notification. The creation/update of a Preview Environment may fail for various reasons including:

  • compilation errors on any code in the C8 monorepo
  • Docker image build errors
  • backwards incompatible changes in the upstream camunda-platform Helm chart
  • bugs preventing successful startup of any included C8 component

Preview Environments are provisioned on cheap but sometimes less reliable hardware to be cost efficient, and can get automatically stopped after inactivity.

Related resources:

Backporting Guidelines

We want crucial security, stability, cost, and other CI improvements applied to all long-living Git branches in the C8 monorepo.

Why we need CI backports

Changes affecting the CI such as introducing new jobs, new observability features or stability fixes are usually developed first on the main branch. We also have several stable/* branches living for multiple years to release maintenance updates.

Due to how Git branches work, every stable/* branch has its own copy of all GHA workflows at the time of forking. Those GHA workflows receive automated Renovate updates for actions. But every human-made CI change needs to be at least considered for manual backporting to ensure that crucial improvements are on all relevant branches.

How to backport CI changes

Follow these instructions to backport PRs with CI changes.

It may be required to resolve Git conflicts when backporting CI changes.

When to backport CI changes

If the CI change matches one of the following:

  • is security-related (incl dependency updates, permissions): MUST backport to all stable/* branches
  • is related to cost reduction, increased reliability or observability: SHOULD backport to all stable/* branches
  • is an in-repository documentation change: SHOULD backport if it:
    • updates procedures, guidelines, or troubleshooting steps that AI agents need for accurate assistance on stable branches
    • fixes incorrect information that would mislead AI agents or developers working on stable versions
    • adds new knowledge about practices, tools, or procedures that apply to stable branch development
    • corrects build, test, or development instructions that affect stable branch workflows

Documentation-specific backporting (monorepo-docs/* folders)

When touching monorepo-docs/* folders, use these guidelines:

DO backport:

  • Critical error corrections in docs
  • Security-related documentation updates
  • Fixes preventing user confusion about that specific version

DON'T backport:

  • New feature documentation
  • Reorganization/restructuring changes
  • Style/formatting improvements

Rationale: Documentation in stable branches serves as AI context for developers working on that specific version. Only backport changes that fix critical errors or security issues, while avoiding feature additions that don't exist in that release.

  • is a new CI job for new product feature or test cases: backport only if the product feature is backported
  • is a new CI feature: backport only if required in the ticket
  • is related to an on: schedule GHA workflow: no need to backport, only works on main
  • is related to Preview Environments: no need to backport, only supported on main

Slack Notifications

All CI workflows in the camunda/camunda monorepo must use the "C8 Monorepo Notifications" Slack app. Messages to Slack should be send via webhooks. The webhook URLs are secrets and stored in Vault for each Slack channel.

If you need to send Slack messages to a channel for which no webhook URL exists yet, reach out via Slack to the Monorepo CI DRI to request one. They then will generate a new webhook URL for the "C8 Monorepo Notifications" Slack app and store it in Vault.

Webhook URL secrets can be retrieved from Vault in GitHub Actions workflows like this:

job-with-notification:
steps:
- uses: actions/checkout@v4

- name: Import Secrets
id: secrets
uses: hashicorp/vault-action@4c06c5ccf5c0761b6029f56cfb1dcf5565918a3b # v3.4.0
with:
url: ${{ secrets.VAULT_ADDR }}
method: approle
roleId: ${{ secrets.VAULT_ROLE_ID }}
secretId: ${{ secrets.VAULT_SECRET_ID }}
exportEnv: false # we rely on step outputs, no need for environment variables
secrets: |
secret/data/products/camunda/ci/github-actions SLACK_MYCHANNELNAME_WEBHOOK_URL;

- name: Send notification
uses: slackapi/slack-github-action@v2
with:
webhook: ${{ steps.secrets.outputs.SLACK_MYCHANNELNAME_WEBHOOK_URL }}
webhook-type: webhook-trigger
# For posting a rich message using Block Kit
payload: |
blocks:
- type: "section"
text:
type: "mrkdwn"
text: "Hello World"

ChatOps

In the camunda/camunda monorepo certain automated workflows can be triggered by posting comments with commands on GitHub Issues and/or Pull Requests. Those commands are then processed by a GitHub Actions workflow.

Available commands:

  • /ci-problems comment on a Pull Request:
    • Synopsis: Triggers a script that analyzes all CI runs related to that PR for CI failures and posts summary as new PR comment.
    • Use case: Can be used by any engineer to get actionable hints on how to address CI problems in a PR.
    • Capabilities:
      • detect problems with self-hosted runners (incl links to dashboards + Kubernetes logs)
      • pipeline timeouts
      • DockerHub connection problems
      • deep links to GHA logs for generic job failures
  • /ci-disable-cache comment on a Pull Request:
    • Synopsis: Adds a new label ci:no-cache to the list of labels of the Pull Request and creates a new empty commit to trigger a new CI run without cache restoration.
    • Use case: Can be used by any engineer to test workflows run from scratch without cache restoration.
  • /ci-enable-cache comment on a Pull Request:
    • Synopsis: Removes the ci:no-cache label from the list of labels of the Pull Request and creates a new empty commit to trigger a new CI run.
    • Use case: Complements the /ci-disable-cache commands and can be used to restore CI regular cache restoration step.

Flaky tests

Tests are called "flaky" when they are not consistenly passing, while the circumstances (source code, test code, execution environment) remain the same. Flaky tests are caused by some inherent unreliability in the system. This needs to be avoided by improving the source code or test code, to improve developer experience and allow smooth automated dependency updates.

GitHub Action workflows with Maven testing Java code should use the flaky-test-extractor-maven-plugin and report the resulting detailed flaky test statistics to our CI health database.

Please use the CI stress testing functionality to avoid introducing new flaky tests.

Related resources:

flaky-test-extractor-maven-plugin

Some Maven modules in the monorepo rerun failing Java tests multiple times (e.g. 3 times, configurable) when they fail and use the flaky-test-extractor-maven-plugin:

  • if a test succeeds at least once during the retries, it is classified as "flaky" by this plugin
  • if a test fails on all retries, it is classified as "failed"
    • this will cause the whole build to fail
    • see the FAQ on how to deal with such cases

To reduce friction, we rerun failing tests up to 4 times on Pull Requests and up to 7 times in the merge queue and on main and stable/* branches.

Stress Testing

There are a few jobs in the Unified CI which frequently contain flaky tests. You need to run a CI stress test on your PR to avoid introducing new flaky tests, if you modify Java source or test code used by these CI jobs:

  • Zeebe Engine QA
  • Zeebe Cluster QA
  • RDBMS ITs

To run a CI stress test, put the ci:stress-test label on any Pull Request and push e.g. a new code change or re-run the Unified CI ci.yml workflow manually.

This will lead to 10 concurrent instances of each of the above CI jobs being started. During a CI stress test, failed tests are not retried so ensure that flakiness is visible as failures.

The results are visible via the usual Pull Request comment featuring the "CI failure summary": as soon as there is any failed CI job like RDBMS ITs (x) showing up, the stress test found an unreliability that must be fixed before merging.

License Checks

We use FOSSA to check dependencies for license compliance with Camunda's policies in order to detect risks early and be aware of licenses of our BOM.

The scan is performed by a GitHub Actions workflows for:

  • each tag
  • each commit pushed to main and stable* (8.6+) branches
  • each Pull Request opened against the above branches

AI Developer Tooling (MCP Servers)

Model Context Protocol (MCP) servers extend Copilot agent mode with additional tools — for example, fetching up-to-date library docs, interacting with GitHub, or querying the Camunda API.

Setup

Run the following once after cloning (requires jq):

make vscode-sync-all

This merges the repository-recommended MCP servers and terminal auto-approve rules into your local .vscode/mcp.json and .vscode/settings.json. Your existing config is preserved. See scripts/vscode-config-sync/README.md for details.

Current baseline is intentionally minimal: GitHub MCP + Context7 MCP.

ServerStatusPurpose
GitHub MCP✅ ActiveIssues, PRs, code search — via https://api.githubcopilot.com/mcp/ (no token needed with Copilot)
Context7 MCP✅ ActiveUp-to-date library and framework documentation for Copilot
Camunda MCP🔜 PendingCamunda Orchestration Cluster API tools — tracked in #43560, not yet available

Optional MCP Servers (self-managed)

The servers above are pre-configured. Additional tools are available but not installed by default — opt in as needed.

Browser automation (Playwright MCP)

Gives Copilot the ability to navigate pages, take snapshots, click and fill forms. Two options:

  • Remote HTTP (simplest, no install): add { "type": "http", "url": "https://playwright.microsoft.com/mcp" } to .vscode/mcp.json
  • Docker stdio (isolated, custom config): docker pull mcp/playwright, then add { "type": "stdio", "command": "docker", "args": ["run", "-i", "--rm", "mcp/playwright"] }. Works with Docker Engine only — no Docker Desktop needed.

To verify, ask Copilot in agent mode: "Navigate to https://github.com/camunda/camunda and take a snapshot of the page".

Docker MCP Toolkit

The Docker MCP Toolkit is worth exploring when you need to run a containerised MCP server with custom environment variables, volume mounts, or secrets injection. For plain HTTP servers (GitHub, Context7) it adds no value. The CLI plugin works standalone on Docker Engine without Docker Desktop.

Copilot Agent Tool Restrictions

The chat.tools.terminal.autoApprove setting in .vscode/settings.json controls which terminal commands Copilot agent mode can run without prompting. The repository template (.github/settings.json.template) pre-configures safe read-only commands (e.g. git status, ./mvnw spotless:apply, actionlint) and explicitly denies destructive ones (e.g. rm, curl to external hosts).

Troubleshooting

How to deal with CI alerts that fire?

Follow the Monorepo CI medic routines and check out the available CI runbooks for each alert.

Why is my CI check failing?

There can be many factors influencing that and it is sometimes hard to find the root caues. Below list should provide guidance:

  1. Try to rerun the failing CI check(s) at least to overcome transient problems.
  2. If on a Pull Request, consider whether the code changes on that Pull Request might cause the CI check failure.
  3. Check if there is an open issue about that failing CI check, e.g. by searching for the error message.
  4. Ask Copilot about the failure using the Explain error button
    • If this doesn't work, find and google the error message.
  5. If the failing CI checks are not part of the Unified CI, contact their owner and see if those CI checks are known to be unstable or flaky.
    • Technically, failing CI checks outside the Unified CI do not prevent merging a PR. If the owners of those failing CI checks agree, you can still merge.
  6. If your PR is removed from the merge queue, check if concurrently there was another PR merged that changes code which your code depends on (e.g. leading to compilation errors in the merge queue).
  7. If a check from the Unified CI is failing on main or a stable branch, try to find the first build with that failing check and investigate the recently merged code changes.
    • Experience shows that most CI check failures are (indirectly) caused by camunda/camunda code changes, and not by external factors like 3rd party services or infrastructure.
    • CI Health metrics can also be used to narrow down the time range, less precise.
  8. Reach out on Slack for help!

How to verify that a CI check is robust and stable, not flaky?

First, create a dedicated branch YOURBRANCHNAME which can be used as a reference for running the CI check later.

If you are working on fixing a flaky test, push the code or build pipeline change(s) that you believe remove the flakiness onto that branch.

Is your CI check part of the Unified CI's ci.yml?

  1. No, but it runs on Pull Requests. Then you have to create a draft PR for YOURBRANCHNAME and manually trigger reruns of the check in question.

  2. Yes! Then you can use the GitHub CLI tool to start repeated runs.

    Optional: remove CI checks you are not interested in from the ci.yml on YOURBRANCHNAME to speed up the execution and save resources.

    Open a new terminal, go to your checkout of camunda/camunda and execute in bash shell:

    for i in {1..10}; do gh workflow run ci.yml --ref YOURBRANCHNAME; done

    This loop will take a while (1 hour or more depending on the CI check) so let it run in the background. After it finished, visit https://github.com/camunda/camunda/actions/workflows/ci.yml?query=branch%3AYOURBRANCHNAME and see if there are any failures (indicates lack of robustness).