Skip to main content

Processes

This page collects processes we follow in the camunda/camunda monorepo.

Renovate PR Handling

We use Renovate to automate dependency updates in the camunda/camunda monorepo. However, not all updates can be automatically merged as-is due to breaking changes or adjustments needed to the code base.

We identify cases where manual intervention is needed by looking at open Renovate Pull Requests which are older than 7 days.

We assign an engineer as DRI to each open Renovate PR that needs manual intervention, to address it. The engineer is chosen based on expertise, e.g. backend engineers for backend dependencies.

The DRI will also be mentioned on the Renovate PR with a link to the below responsibilities, and reminded of that 3 weeks after their assignment (delayed).

We have GitHub teams of engineers grouped by expertise:

If you spot errors in the grouping, reach out to the Monorepo DevOps team!

Overview

An overview of Renovate PRs that need manual intervention and are assigned or even delayed can be found on this GH project board.

To identify potential overloads of one group there are views with breakdowns available by expertise.

Automation

We use a daily GitHub Actions workflow to execute the assignment logic script automatically. Actions and PR comments are done via a GitHub App.

DRI Responsibilities

The DRI for an open Renovate PR is responsible for addressing the dependency upgrade and getting it merged in a timely manner. They are expected to:

  • Get necessary information from the changelog and usages from our code base.
    • Reach out to your peers via #team-orchestration-cluster on Slack in case of more questions.
  • Do adjustments to the code base to accommodate dependencies with breaking changes, ensuring green CI and passing tests.
    • Minor and patch updates should almost always be resolved in this way, without an extra ticket.
  • Explore AI-assistance to get breaking changes fixed.
  • If needed, create follow-up tickets for major upgrades or big refactoring tasks, and ensure that those tickets are scheduled within their team's planning.
  • Improve our Renovate configuration to make sure updates are as smooth as possible, e.g. by:
    • Applying appropriate grouping of similar dependencies (all React components, similar Maven plugins) to reduce the number of PRs.
    • Allow auto-merging if possible.
  • Complete the above steps within 3 weeks of the DRI assignment.
    • Reach out to your manager to adjust priorities or find a replacement if the above timeline is not possible.

Monorepo CI Medic

Purpose: Be the central point-of-contact for problems and alerts with the CI & automation around the C8 monorepo and drive resolution of incidents based on their severity.

Contact: Mention @monorepo-ci-medic in #top-monorepo-ci on Slack.

Response times: as quickly as possible

Handovers: weekly sync meeting with handover notes

Responsibilities & Process

React to alerts in #monorepo-ci-alerts

Triage problems reported in #top-monorepo-ci

  • React with 👀 for Slack threads that ping you that you are checking/aware of and for resolved threads.
  • Verify/reproduce the reported problem to make sure it is not a fluke or misunderstanding:

Unhandled FOSSA Licensing Issues reported in #top-monorepo-ci

  • React with 👀 for Slack threads that ping you that you are checking/aware of and for resolved threads.
  • Make sure to read the internal documentation: FOSS in Camunda 8: FOSSA (and the Handle License Issues part specifically)
  • Follow the Playbook for the first actions to take, including:
    • Curate issues and eliminate false positives.
    • Multi-licensed components
    • Other low-complexity cases with known outcomes
  • Escalate to Legal when license terms or obligations require clarification
    • Open a Jira ticket directly from the corresponding FOSSA issue
    • e.g. issue & Jira ticket
  • Delegate to the responsible team medics when Engineering actions are needed: use either of the 4 medics:
    • @core-features-medic (Operate, Optimize, Tasklist)
    • @zeebe-medic (Zeebe Engine)
    • @data-layer-medic (Elasticsearch, Opensearch, RDBMS layer)
    • @identity-medic (Identity) to delegate.
    • If you're not sure about which medic to choose, pick the one that makes most sense to you. They will reroute you to the right one.
  • When delegating:
    • clarify distribution scope, remove or replace a dependency, etc. (example: OJDBC dependency)

Drive resolution of incidents

  • Follow the CI incident management process.
  • Debug problems on GitHub Actions level yourself, involve the stakeholder teams (via their medic) or subject-matter experts for advice on technical details in certain sub-areas:
    • CI Knowledge Base: https://camunda.github.io/camunda/ci
    • Core Features: @core-features-medic on Slack (e.g. issues with Operate, Optimize, Tasklist)
    • Core Foundations:
      • Data Layer: @data-layer-medic on Slack (e.g. issues with OpenSearch/Elasticsearch tests)
      • Identity: @identity-medic on Slack (e.g. issues with Identity and Identity Management)
      • Zeebe: @zeebe-medic on Slack
    • Self-Managed:
      • Distro @distro-medic on Slack (e.g. Helm chart integration tests)
    • Infra: @infra-medic on Slack (e.g. self-hosted runner problems, Vault issues)
  • Try to identify a (limited) workaround to unblock users.

CI Incident Management

This section is for the camunda/camunda monorepo CI. See all available incident types.

Definitions

Incident

By incidents we refer to problems affecting the C8 Monorepo CI that need to be managed in a structured way.

If any of the following is true, an event is an incident:

  • Is a C8 release blocked or delayed?
  • Are Camundi blocked from creating and merging PRs (unrelated to their code)?
  • Is it a large-scale problem affecting many engineers/branches/CI job executions?
  • Is the issue unresolved even after an hour of concentrated analysis?

Severity

We re-use the generic severity levels from IM - Service Levels and Criteria with the following meanings:

  • L1: Problem blocks the Unified CI workflow on main or stable/* or in the merge queue to main or stable/* branches or blocks release workflows
    • E.g. no PR mergeable (e.g. 1, 2, 3), 3+ consecutive executions of same job failing incl retries, external service unreachable
  • L2: Problem degrades the Unified CI workflow on main or stable/* or in the merge queue to main or stable/* branches
    • E.g. 3+ consecutive executions of same job taking 50%+ longer, sporadic timeouts, sporadic build failures, external service degraded (e.g. 1)
  • L3: Used for tracking a prominent flaky test or slow Unified CI job until resolution
    • E.g. same test has been flaky in dozens of builds over 1 or multiple days, Unified CI job is slower than runtime SLO and needs speedup

Roles

From IM - Roles and Responsibilities:

  • Incident Commander (IC)
  • Communications Lead (CL)
  • Operations Lead (OL)

Steps

Effective incident management is key to limiting the disruption caused by an incident and restoring normal business operations as quickly as possible.

1. Identification

An incident can be identified by any engineer in the C8 or Infrastructure domain via:

Once an incident is identified, the Monorepo CI Medic becomes the incident commander (IC). The IC holds all roles until they delegate to someone else.

Incident Command
  1. Create an incident as described in IM - Lifecycle:
    • Quickly estimate the severity of the incident. Be pessimistic; we can always downgrade
  2. Record everything in the incident channel and pin a message to persist the contents on the timeline.
  3. Take a quick triage of the incident, and remember that you don't need to solve this yourself!
    • Incidents for which no known fix or mitigation can be applied and which fall into the responsibility of another team should be handed over to them as soon as possible. In this case, they become IC, and Monorepo CI Medic stays on standby to support and apply necessary changes to the C8 Monorepo CI.
    • If you feel overwhelmed, it's a good idea to get others involved quickly especially on L1. Particularly consider reaching out to the push engineer - in case of a human-triggered incident, this person often has the most state and should be involved as quickly as possible.

2. Response

Incidents are resolved according to the Engineering Incident Management Process by the Incident Response Team. Helpful resources:

The Communications Lead should send out periodic updates depending on the severity of the incident:

  • L1: Once per hour until resolved
  • L2: Once per day until resolved

3. Follow-Up

Incident Command
  1. Mark incident as Resolved (see IM - Lifecycle) and notify stakeholders.
  2. The Monorepo CI Medic is responsible for ensuring the post-incident activities happen:
  • Follow the IM - Post Mortem procedures, especially Create and schedule a Post Mortem if needed
  • The post-mortem meeting (mandatory for L1 severity, otherwise optional) should occur no more than 5 working days after the incident is resolved. It is recommended to try and asynchronously do a post-mortem if possible. Use the example meeting invite to schedule early; you can always use it for root-cause hunting if the incident is not ready for review.
  • Assign due dates and assignees to any actionable tasks from the post-mortem meeting.
  • Mark incident as Closed (see IM - Lifecycle).