Flaky Test Gate (sticky alerts)

The flaky test gate is a CI quality gate in the Unified CI that blocks PRs from introducing new flaky tests. Once a test is flagged on a PR, the alert is sticky — it remains until either the test is actually fixed or the bypass label is applied. A re-run that happens to pass does not silence the alert.

This page is the canonical reference for how the gate behaves: it covers the rules, the comment UX, the workflow integration, and the operational concerns. The implementation lives in .github/actions/detect-new-flaky-tests/ and is wired into .github/workflows/ci.yml as the detect-new-flaky-tests job.

Why sticky?

Before sticky alerts, the bot resolved the comment whenever a subsequent CI run found no new flakes. That meant any retry-luck pass silenced the alert — authors could hit "Re-run jobs" until CI went green and merge with the latent flake still present. Empirical audit of the 6-day window after PR #53683 showed that 4 out of 4 "resolved" alerts in that window had no code fix — they were all pure retry luck.

The sticky design closes that loophole. The trade-off is friction: real flaky-test fixes now need to demonstrate that they hold across multiple runs before the gate releases the PR.

Clearance rules

An alert entry stays active until one of these conditions is met:

Path	Trigger	Resulting status
Method fix	The flagged test's method body is modified, AND the originally affected job runs clean (no Maven retries) for at least 3 subsequent gate runs.	`cleared_via_fix`
Bypass label	The `ci:flaky-test-bypass` label is applied to the PR.	`cleared_via_bypass`
Force-push drop	The PR is rewritten so the original flagging commit is no longer reachable, AND the test does not flake in the new history.	`dropped_force_push`

"Clean run" has a strict definition (Def-2):

The test was actually observed in this run (the job wasn't skipped/cancelled), AND
The test did not appear in FLAKY.xml (passed first try, zero Maven retries).

A Maven retry-pass (test failed once and passed on retry) still ends up in FLAKY.xml, which resets the counter to 0. The gate considers any retry as evidence the fix did not hold.

The counter is per-test and scoped to the same parent job as the original flake. Matrix entries roll up to the parent name — e.g. identity-tests/identity-tests - elasticsearch9 rolls up to identity-tests. A clean run in a different job does not advance the counter.

Re-runs of the same SHA count just like runs on later commits. Authors don't have to push throwaway commits to bank evidence — but the gate is only reachable after a real modification, so re-run-spam alone does not advance the counter.

What developers see

The gate maintains a single PR comment, identified by the hidden marker . Its content reflects the current state.

Active alert

# ⚠️ New Flaky Tests Detected

This PR introduces **1 new flaky test(s)** that are not currently
flaky on `main`, `stable/*`, or in any other open PR.

- **shouldFlushBatchWhenFull**
  - Jobs: `general-unit-tests`
  - Package: `io.camunda.exporter.appint.subscription`
  - Class: `SubscriptionTest`
  - State:
    - First flagged at: `abc1234`
    - Method last modified at: — (no fix detected)
    - Clean re-runs since fix: 0 / 3
    - Last observed: 2026-06-03 09:42:00 UTC

---

**What to do:**
1. Fix the flaky test method, push the commit, then let CI run 3 times.
2. If unrelated to your changes: add the `ci:flaky-test-bypass` label.

Active alert after a fix attempt

  - State:
    - First flagged at: `abc1234`
    - Method last modified at: `def5678`
    - Clean re-runs since fix: 2 / 3
    - Last observed: 2026-06-03 10:21:00 UTC

The counter advances each time CI runs and the test passes cleanly in the originally affected job.

Mixed: some active, some cleared

# ⚠️ New Flaky Tests Detected
...
- **stillBroken** ...

<details>
<summary>1 cleared test(s) (history)</summary>

- ~~**alreadyFixed**~~
  - ~~Class: `SomeIT`~~
  - ~~cleared on `def5678` after 3 clean re-runs~~

</details>

All clear

# ✅ Cleared — No outstanding new flakes

All previously flagged tests cleared via fix + 3 clean re-runs, or via
`ci:flaky-test-bypass` label.

<details>
<summary>Previous warning</summary>
... strikethrough history ...
</details>

The "Cleared" headline is intentionally distinct from the legacy "Resolved" word — the old "Resolved" template fired on any retry-luck pass, and reusing it would conflate the two behaviors.

When the gate runs

The job runs on pull_request events from non-fork repos, with these exclusions:

PRs from monorepo-devops-automation[bot] (backport bot).
PRs from renovate[bot] — dependency bumps don't introduce flakes; any flake they hit is pre-existing infra noise. See #53683 for the rationale.
PRs whose head_ref starts with backport (manually-named backports).

Direct fixes pushed to stable/* branches by humans still run the gate.

Common scenarios

A. PR introduces a genuine new flake

CI runs; test flakes; bot posts active alert. clean_runs_since_modified: 0 / 3.
Author edits the test method (or the production code it exercises), pushes commit.
Gate detects the modification via git log -L; Method last modified at updates.
Next CI runs; test passes cleanly; counter advances 0 → 1 → 2 → 3.
At counter 3, status flips to cleared_via_fix; comment becomes "✅ Cleared".

B. PR triggers an unrelated flake

CI runs; test flakes; bot posts active alert.
Author concludes the flake is pre-existing infra noise.
Author opens a kind/flake issue documenting the test, job, and run.
Author applies the ci:flaky-test-bypass label.
Next CI runs; all active entries → cleared_via_bypass; comment becomes "✅ Cleared".

C. Author tries to merge by re-running CI repeatedly

CI runs; test flakes; bot posts active alert. Method last modified at: —.
Author hits "Re-run jobs" three times without touching the code.
Each run sees the test pass on retry. Counter stays at 0 — there's no modification, so the counter doesn't increment regardless of how many clean runs accumulate.
Alert remains active; merge stays blocked.
Author has to either actually fix the test or apply the bypass label.

D. Author force-pushes to rewrite history

CI runs on commit X; test flakes; entry stored with first_flagged_sha: X.
Author rebases the branch; new HEAD is Y; commit X is no longer reachable.
Next CI run: gate re-evaluates against the new history.
- If the offending code is still present and the test still flakes → entry stays active; counter logic continues with a new anchor SHA via git log -L against merge-base(base_ref, HEAD)..HEAD.
- If the offending code was dropped from the new history AND the test isn't flaking → entry → dropped_force_push; alert clears.
There's no way to force-push to silence an active flake — the gate always re-evaluates and re-detects the modification (or its absence).

Workflow integration

The detect-new-flaky-tests job is defined in .github/workflows/ci.yml. The job stays thin: it computes the run-specific context (bypass label, ran-jobs JSON from needs.*.result, BigQuery baseline) and hands it to the detect-new-flaky-tests composite action, which owns all sticky-state plumbing. Key implementation details:

Checkout uses fetch-depth: 0 because git log -L requires full history.
State artifact is named flaky-gate-state-pr-<PR_NUMBER> with 30-day retention. The action downloads the latest non-expired artifact via gh api, mutates the JSON, and re-uploads with overwrite: true — the workflow no longer wires state files in/out.
No-op short-circuit lives in the detector: when there is no prior state, no new flakes this run, and no bypass label, it returns without posting a comment or writing state, so clean PRs are left untouched.
Hidden comment marker  points to the artifact for cross-reference.
BigQuery baseline is queried only when this run produced new flakes. The query lives in the workflow step "Query known flaky tests from BigQuery."
Blocking is true (default) — sticky entries fail check-results.

State schema

{
  "schema_version": 1,
  "pr_number": 54375,
  "last_known_head_sha": "ad7705f...",
  "last_updated_at": "2026-06-03T11:42:00Z",
  "tests": [
    {
      "key": "io.camunda.it.auth.SomeIT.shouldDoX",
      "package": "io.camunda.it.auth",
      "class_name": "SomeIT",
      "method_name": "shouldDoX",
      "file_path": "qa/integration/.../SomeIT.java",
      "first_flagged_sha": "ad7705f...",
      "flagged_jobs": ["identity-tests/identity-tests - elasticsearch9"],
      "method_last_modified_sha": null,
      "clean_runs_since_modified": 0,
      "last_observed_sha": "ad7705f...",
      "last_observed_at": "2026-06-03T11:42:00Z",
      "status": "active",
      "cleared_at": null
    }
  ]
}

Statuses: active, cleared_via_fix, cleared_via_bypass, dropped_force_push.

BigQuery baseline query

The baseline returns every test that produced an unreliable result (flaky, failure, or error) on main, stable/*, in scheduled nightly runs, or in any other open PR over the last 20 days. PR runs are included; the current PR's own observations are excluded.

SELECT DISTINCT test_class_name, test_name
FROM   `ci-30-162810.prod_ci_analytics.test_status_v1` ts
LEFT OUTER JOIN `ci-30-162810.prod_ci_analytics.build_status_v2` bs
  ON  ts.ci_url = bs.ci_url
  AND ts.build_id = bs.build_id
  AND ts.job_name = bs.job_name
  AND (
    (bs.build_trigger = "merge_group" AND (bs.build_base_ref = "refs/heads/main" OR bs.build_base_ref LIKE "refs/heads/stable/%"))
    OR (bs.build_trigger = "push"       AND (bs.build_ref      = "refs/heads/main" OR bs.build_ref      LIKE "refs/heads/stable/%"))
    OR (bs.build_trigger = "schedule"   AND (bs.build_ref      = "refs/heads/main" OR bs.build_ref      LIKE "refs/heads/stable/%"))
    OR (bs.build_trigger = "pull_request" AND bs.build_ref != @pr_ref)
  )
WHERE  ts.report_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 20 DAY)
  AND  ts.ci_url = "https://github.com/camunda/camunda"
  AND  ts.test_status IN ("flaky","failure","error")
  AND  bs.ci_url IS NOT NULL

The trigger/ref filters live in the ON clause, not WHERE. Putting them in WHERE (the prior shape) silently turned the LEFT JOIN into an INNER JOIN whenever build_status_v2 ingestion lagged test_status_v1, dropping known flakes from the baseline and producing false-positive alerts. The current shape preserves rows even when the matching build-status row hasn't ingested yet; the bs.ci_url IS NOT NULL clause then keeps only matched rows for the comparison.

Method modification detection

For each active entry, the detector runs:

git log --format=%H -L:<method_name>:<file_path> <merge-base(base, HEAD)>..HEAD

Git's -L :funcname:file matcher uses Java-aware function-boundary detection. The first SHA output is the most recent commit that modified the method body within the range. None means no modification; the counter stays at 0.

Known limitations:

Java overloads sharing a method name are treated as one unit. Modifying any overload counts as "fix attempted."
Renamed / moved tests (changed class or file) are not followed. Use the bypass label.

Operating the gate

Monitoring

The gate's per-PR state is in the artifact flaky-gate-state-pr-<PR_NUMBER> (run summary → Artifacts).
The PR comment with the active status and counter is the user-facing summary.
Job logs include [new-flaky] debug lines: baseline size, per-test decision, counter movement, drop reasons.

Target false-positive rate

The design target is ≤ 5% false positives over a 14-day rolling window. A "false positive" here means: the gate flagged a test that was actually known-flaky on main/stable/* or in another open PR within the prior 20 days, but didn't appear in the baseline. The most likely cause is BigQuery ingestion lag for very recent flake observations on sibling PRs.

To replay a single PR's baseline query at a chosen cutoff (to classify an alert post-hoc), see the "Running locally" section in the action's README.

When the gate fires on your PR

Look at the comment first. It tells you which test, which job, and how many clean re-runs you have.
If the test name is from your diff, you introduced it — fix the method and push. The next 3 clean CI runs will clear it.
If you didn't touch anything related, it's a pre-existing flake that just hit you. Apply ci:flaky-test-bypass (after opening a kind/flake issue) and re-run CI.
If you're convinced it's a false positive from the gate itself (e.g. the test has clearly flaked on main recently), confirm with the BigQuery replay query and reach out to #ask-monorepo-devops. The bypass label always works as an escape hatch.

Disabling the gate temporarily

Single-line revert in .github/workflows/ci.yml:

-          blocking: 'true'
+          blocking: 'false'

This keeps the sticky logic and comments but stops the gate from failing check-results. Useful if the gate misbehaves at scale and you need to unblock everyone while the root cause is investigated. The state artifact and comment continue to update normally.

Initial gate introduced in #50691 — "feat: Add logic to filter out new flaky tests."
20-day baseline + PR-runs inclusion added in #53121.
Renovate exclusion added in #53683. 6-day audit showed 0 Renovate alerts post-merge — exclusion fully effective.
Sticky alerts + BQ LEFT-JOIN fix added in #54579. Closes the retry-loophole and the ingestion-lag FP source.

Action README — implementation deep-dive, all inputs/outputs, local-run instructions.
CI & Automation › Flaky tests — broader context on how Maven test reruns and flaky-test-extractor classify results.
CI Runbooks — what to do when CI alerts fire.
CI Health metrics — the BigQuery analytics layer the gate queries.

Why sticky?​

Clearance rules​

What developers see​

Active alert​

Active alert after a fix attempt​

Mixed: some active, some cleared​

All clear​

When the gate runs​

Common scenarios​

A. PR introduces a genuine new flake​

B. PR triggers an unrelated flake​

C. Author tries to merge by re-running CI repeatedly​

D. Author force-pushes to rewrite history​

Workflow integration​

State schema​

BigQuery baseline query​

Method modification detection​

Operating the gate​

Monitoring​

Target false-positive rate​

When the gate fires on your PR​

Disabling the gate temporarily​

History and related work​

Related documentation​