CI Runbooks
This page collects information on CI incident runbooks and alerts that affect the C8 monorepo CI, including runbooks on how to respond to them. For general information see the CI & Automation page.
Incident Runbooks
This section collects useful information and links for people dealing with incidents affecting the monorepo CI.
Checking Important Status Pages
When: If you suspect a problem comes from another (external) service.
What: Check the following:
- Camunda CI Platform status (Infra team)
- GitHub status (Actions, PRs, API, Git, etc.)
- DockerHub status (Docker image push/pull)
- Maven Central status (Maven artifact up-/downloads)
Temporarily Disable Tests To Lessen Impact
When: There is an incident caused by a flaky or failing test, that causes blockers for many other contributors by e.g. always failing in the merge queue. Make sure that test is safe to disable temporarily.
What: Remember these goals during an incident:
- Stop the bleeding.
- Find a proper cure/fix.
In context of CI, many people are waiting for their work to be merged. If there is a workaround for instability in tests or flakiness, apply the workaround first. Afterwards, you can invest more time into finding a well-designed solution for the problem. This will lower stress and unblock others.
One workaround can be temporarily skipping/disabling tests:
- For Java one option is the
@DisabledJUnit annotation. Locate the source code of the test, add the annotation and raise a Pull Request with your change. You always have to specify the reason and link to an incident/ticket for the proper solution.
Bypassing GitHub Merge Queue
When: If the merge queue to main or other stable/* branches is
not working (reliably) and you need to merge a change to fix an incident. This could e.g. be the
case for fixes to the release process.
What:
Bypassing the merge queue (for one PR) does not mean disabling it entirely, as that would allow other PRs e.g. from Renovate to be automatically merged even though they fail CI.
- Create a Pull Request with your fix, explain why it is needed and why the merge queue needs to be bypassed.
- Ask a repository admin to temporarily change the
unified-ci-merges-*-branch ruleset for the desired branch
to add your group (
Monorepo DevOps Team) to theBypass listand save. - Merge your Pull Request with admin override.
- Ask a repository admin to temporarily change the
unified-ci-merges-*-branch ruleset for the desired branch
to remove you from the
Bypass listand save.
Alert Runbooks
Note: Don't change the heading names below — they are required as stable links!
Merge Queue High Failure Rate
A high rate of unsuccessful runs for a GHA job in the merge queue to
main or stable/8.x branches means that PRs
cannot get merged, and engineers are blocked.
We use L2 incidents to coordinate work in mitigating and resolving the blocker. They start in triage. Please initially verify via the Job Trends dashboard whether the job has a consistently high failure rate. If so, confirm the incident.
If the failure rate is above 50%, please increase severity to L1.
The initial Incident Commander attribution is never 100% accurate so it is fine to re-assign the Incident Commander role as new information becomes available.
Troubleshooting
First analyze the symptoms of the merge queue failures of this particular job, e.g. whether the job runs into timeouts or a certain step/command fails.
When multiple alerts are firing simultaneously, check if they are all affected by a similar issue — some examples could be DockerHub availability, Nexus issues, or other.
Always consider mitigation options first in order to "stop the bleeding" and unblock other engineers quickly with temporary fixes, before starting a deep root cause analysis.
E.g. increase the timeout, use a bigger runner, disable certain unreliable steps/commands or test cases in order to ensure CI stability again.
Solutions
Depends on the kind of failure.
Push main High Failure Rate
Unsuccessful Unified CI GHA jobs on push to main branch mean that artifacts
might not get built nor uploaded. Those same jobs being green is a precondition for the
merge queue — thus a high failure rate over the last hours can indicate
a general CI instability e.g. with infrastructure or remote network services.
This can prevent artifact uploads and needs to be investigated.
Troubleshooting
Verify the high rate of unsuccessful GHA jobs (for push to main) over the last hours in the CI
Health dashboard. Drill down into the list of recent unsuccessful jobs and check their GHA logs for
common symptoms and correlate known issues.
Check for GitHub Actions outages.
Solutions
Depends on the kind of failure.
Selfhosted Runner High Disconnect Rate
Disconnected self-hosted runners mean that GHA jobs could not run until success, producing failed builds. A high disconnect rate over the last hours can indicate a general CI instability e.g. with infrastructure or a specific job.
This can block developers and needs to be investigated quickly.
Troubleshooting
Verify the high rate of disconnected self-hosted runners over the last hours in the CI Health dashboard.
Drill down into the list of recent jobs aborted due to self-hosted runner disconnects. For those jobs check the following:
- The jobs' GHA logs for errors/problems/last command before the disconnect.
- Do the jobs have anything in common? Always same branch or same job etc?
- If always same job, likely related to the job and not the infrastructure.
- Correlate the job with potential Kubernetes pod or node (
Out of memory: killed ...) problems, e.g. via GKE logs or K8s events.- If a node is getting shut down, likely related to infrastructure.
- If a node has Out Of Memory (OOM) kills, likely related to the job needing more resources, e.g. INC-2230.
Useful GKE log explorer links (time frame/scale set/runner name should be adjusted):
- Last 3h of all workload logs for the camunda-gcp-perf-core-8-default scale set (runner name can be added in quotes as well)
- Last 3h of all control plane logs affecting the camunda-gcp-perf-core-8-default-t6mr7-runner-c5qwf runner
The /ci-problems ChatOps command on PRs can be helpful to get links to
resources.
Solutions
Depends on the kind of failure. One option is to use
self-hosted runners with -longrunning suffix that are more stable
but also more expensive.
Merge Queue High Length
A high number of PRs in the merge queue for the main branch means that developers must wait more time to have their PR merged. This increases the chance of the PR becoming outdated, leading to a poor developer experience. It may also indicate congestion and/or CI problems. Such a situation can block progress in the monorepo and should be investigated.
Troubleshooting
Verify the high number of queued PRs over the last hours in the CI Health dashboard. Drill down into the list of PRs and check their GHA logs for common symptoms and correlate known issues.
Check for GitHub Actions outages.
Solutions
Depends on the kind of failure. In general, quick mitigations/workarounds are preferred to unblock developers before going into root cause detection.
Merge Queue Eviction Rate
A high number of PR evictions from the merge queue (while waiting to be merged into the main branch) typically occurs due to CI failures during the merge queue process. This results in longer wait times for developers, increases the risk of PRs becoming outdated, and contributes to developer frustration. It can slow down development in the monorepo and should be addressed promptly.
Troubleshooting
Verify the CI Health dashboard for a high eviction rate over the last few hours. Identify affected PRs and check their GHA logs for common symptoms and correlate known issues.
Check for GitHub Actions outages.
Solutions
Depends on the kind of failure. In general, quick mitigations/workarounds are preferred to unblock developers before going into root cause detection.
Snapshot Artifacts Stale
Snapshot artifacts in DockerHub and Artifactory have not been updated for more than 3 days, indicating potential CI pipeline issues that could block development teams from accessing the latest development versions.
Troubleshooting
Check which specific artifacts are stale and investigate any recent CI failures.
Solutions
Fix and rerun failing artifact publishing workflows.
Snapshot Artifacts Missing
Expected snapshot artifacts are completely missing from registries, or the artifact-metadata-exporter cannot retrieve metrics, indicating CI infrastructure failures or monitoring issues.
Troubleshooting
Check if this is a publishing issue by reviewing recent CI workflow failures, or a monitoring issue by verifying artifact-metadata-exporter service status.
Solutions
Fix any CI pipeline publishing issues. You may need to contact the Infra team for monitoring issues.
Camunda Helm Chart Integration Test Failure
You may observe one or more of the following:
- Helm install timeout.
- GitHub Actions step failure at:
Helm - Install - Install Camunda chart(exit code 201 = installation failed). - Warnings about ignored or overridden values in Helm output.
- Pods in
CrashLoopBackOff, failing readiness or liveness probes. - Pods "Running" but the service is non-functional.
- Repeated restarts.
- Supplied values not compliant with required defaults (
values.yaml).
In essence: the deployment did not complete successfully, at least one core component failed to start, or the CI job timed out before readiness.
Troubleshooting
- Retry the workflow (fast feedback) to confirm if the issue is transient.
- Inspect the
Helm - Install - Install Camunda chartstep logs:-
Look for hard errors (e.g. validation failures, timeouts).
-
Warnings like the following one should not be considered blocking unless they directly relate to a failed service startup. Example:
warning: destination for camunda-platform.connectors.security.authentication.oidc.existingSecret is a table. Ignoring non-table value ()
-
- Check the
Get failed Pods infostep:- Identify pods in
CrashLoopBackOff. - Note restart counts.
- Search for readiness / liveness / startup probe failures.
- Identify pods in
- If unclear, look at container logs for the failing pod(s).
Solution
Possible remediation actions (select based on observed symptom):
- If Helm install failed (exit code 201): Re-run the job to confirm reproducibility and capture full failure logs.
- If some component is in CrashLoopBackOff: Inspect logs and redirect the issue to the medic responsible for that component.
- If components are running but non-functional: Validate operational behavior (not just status);
look for configuration alignment with
values.yaml.
Check Licenses Workflow High Failure Rate
A high failure rate in check-licenses.yml may indicate instability in license check jobs
(analyze/single-app, analyze/optimize, or other analyze/* jobs). This does not block merges,
but it may prevent license issues from being detected/addressed and should be investigated promptly.
While some failures can be expected for pull requests introducing new licensing issues,
mainandstablebranches should stay free of errors.
Troubleshooting
- Verify failures in the CI Health dashboard.
- Review logs of failed jobs in GitHub Actions:
- License validation or dependency scan errors
- Network issues
- Runner timeouts or OOM errors
- Check for recent merges modifying configs (e.g.
check-licenses.ymlandfossa.yml). - Confirm GitHub Actions or FOSSA services are operational.
Solutions
- Retry failed jobs to rule out transient issues.
- Fix any broken dependencies or config errors.
- Escalate to Infra if failures persist.
Preview Environment Smoke Test Failure
The Preview Environment Smoke Test
runs weekly on Mondays to verify that preview environment deployments are working correctly. A
failure indicates a potential issue with the preview environment infrastructure that could affect
developers using the deploy-preview label on their PRs.
Troubleshooting
- Check the failed workflow run linked in the Slack notification:
- Identify which job failed:
Build Camunda,Build Optimize, orDeploy Preview Environment
- Identify which job failed:
- For deployment failures:
- Check ArgoCD for the smoke test app (named
camunda-smoke8x-<run_number>).- Direct URL can be found in the workflow run summary.
- Look for Kubernetes resource issues (pods not starting, image pull errors, etc.)
- Verify the preview environment Helm chart in
.ci/preview-environments/charts/c8sm/
- Check ArgoCD for the smoke test app (named
- Check for transient issues:
- Re-run the workflow manually via
workflow_dispatchto confirm whether the issue persists.
- Re-run the workflow manually via
Solutions
- Component build/startup failure: Investigate recent changes and eventually delegate to the corresponding Medic DRI.
- Infrastructure issue (ArgoCD / K8s cluster): Contact the Infra Team (
@infra-medic). - Helm chart issue: Review recent changes to
.ci/preview-environments/and the camunda-platform Helm chart.- It may be resolved by upgrading the Helm chart or by fixing
values.ymland/or other configuration files. - For issues with the camunda-platform Helm chart, help can be sought from the Distribution team
(
@distro-medic).
- It may be resolved by upgrading the Helm chart or by fixing
- Playwright Smoke Tests issue: Contact the QA team via the
#ask-qaSlack channel and tag the@test-automation-team.
Cleanup
If the smoke test environment was not torn down (due to a failure or skip-teardown), it will be
automatically cleaned up by the
preview-env-clean
workflow after 12 hours.
To manually clean up:
- Go to ArgoCD.
- Delete the corresponding application with labels
preview=smoke-testandrepo=camunda_camunda(app name:camunda-smoke8x-<run-number>), which can be found in the workflow run logs.
Unified CI High Job Runtime
A successful but slow Unified CI GHA job on push, PRs and merge queue to main or stable/8.x
branches makes all engineers wait longer and indicates a
Runtime SLO breach.
We use L3 incidents to coordinate timely improvement work to speed up the CI. They start in triage. Please initially verify via the Job Trends dashboard whether the job is consistently breaching the SLO and/or trending upwards. If so, confirm the incident.
The initial Incident Commander attribution is never 100% accurate so it is fine to re-assign the Incident Commander role as new information becomes available.
Solutions
As Incident Commander please analyze the slow Unified CI job and figure out how to make it faster, e.g. by eliminating bottlenecks, caching, or splitting it into multiple parallel jobs.