Skip to main content

REST API and OIDC

· 9 min read
Christopher Kujawa
Principal Software Engineer @ Camunda
Pranjal Goyal
Senior Software Engineer @ Reliability Testing Team
Jonathan Ballet
Senior Software Engineer @ Reliability Testing Team

Over the past weeks, we have been spending more time improving our load testing and reliability testing coverage. One of the things we did was to enable REST API (by default, we tend to use gRPC).

While doing such, we were experiencing a weird load pattern. This seems to occur when enabling the REST API usage in our load tester clients, together with OIDC.

On today's Chaos day, we want to verify how the system behaves when using the REST API and OIDC together, and how this changes under different loads and versions. We were also validating whether this was related to the cluster configuration (testing with SaaS).

TL;DR; We were seeing recurring throughput drops, especially at higher load (300 PIs), but at lower load they were not visible. The issue was reproducible in 8.8 as well, so it was not related to the changes in 8.9. We couldn't reproduce the pattern in SaaS, as we weren't able to achieve the same load with the small clusters we used. While experimenting, we discovered several areas for improvement. The root cause turned out to be JWT tokens expiring while requests queued in the Apache HttpAsyncClient connection pool. Nic fixed this by moving token injection to after connection acquisition via #50124 🚀

rest-bug

C8 on ECS: Simulate loss of lease

· 5 min read
Deepthi Akkoorath
Principal Software Engineer @ Camunda
Rodrigo Lopes
Associate Software Engineer @ Zeebe

With 8.9, we support C8 deployments on ECS. Camunda 8 is originally designed for Kubernetes StatefulSets, where each broker has a stable identity and disk. On Amazon ECS, tasks are ephemeral: IPs and container instances change frequently, and you rely on external storage like EFS and S3 instead of node-local disks.

To make this work safely, the Camunda 8 ECS reference architecture introduces a dynamic NodeIdProvider backed by Amazon S3. Each ECS task:

  • Competes for a lease stored in S3 that represents a specific logical broker node ID.
  • When it acquires the lease, it becomes that broker and uses a dedicated directory on shared EFS for its data.
  • Periodically renews the lease; if renewal fails or preconditions are violated, the task shts down immediately to avoid corrupting data or having two brokers think they own the same node.

In this experiment we explore what happens when a broker loses its S3-backed NodeId lease and another broker acquires it. In this experiment we simulate that scenario by artificially overwriting the lease object in S3 to represent a new owner and then observe how the original holder reacts.

Goal

Hypothesis:
If the S3 lease for a node Id is lost by the task, the NodeIdProvider should:

  • Detect the inconsistency via conditional writes,
  • Refuse to renew the lease,
  • Shut the broker down cleanly so that ECS can replace it with a fresh task that acquires a new, valid lease.

Setup

  • Camunda 8 (Zeebe) on AWS ECS Fargate
  • 3 brokers, 3 partitions
  • Shared data on EFS
  • NodeIdProvider using S3 leases:
    • One object per logical node (e.g. 2.json)
    • Metadata carries the task id, version, and acquirable flag
    • Object body holds the lease payload (node id, version, known version mappings, timestamp)

Before the experiment, the S3 object for node 2 looked like this.

Metadata:

"Metadata": {
"taskid": "0afffc8d-3807-46cb-9a2e-3f65f96d2acb",
"version": "2",
"acquirable": "true"
}

Payload:

{
"taskId": "0afffc8d-3807-46cb-9a2e-3f65f96d2acb",
"timestamp": 1774433501584,
"nodeInstance": { "id": 2, "version": 2 },
"knownVersionMappings": {
"mappingsByNodeId": {
"0": 2,
"1": 3,
"2": 2
}
}
}

This represents broker node 2, version 2, with a lease that is currently acquirable.

Injecting failure: overwriting the lease in S3

To simulate the loss of the lease, we modified the timestamp and the taskId in the current 2.json object and overwrote the object.

aws s3api put-object \
--bucket "dev-chaos-day-oc-bucket" \
--key "2.json" \
--body 2.json \
--metadata "version=2,acquirable=true,taskId=abc"

From the broker’s point of view, the lease it thought it owned has now been rewritten by someone else. In a real cluster, this situation should not occur as long as the current holder keeps renewing its lease within the configured lease duration. The overwrite here is artificial and is meant to simulate a scenario where the current holder has stopped renewing, and another broker has legitimately acquired the lease in the meantime.

What we observed in the logs

Shortly after the override, the task assuming the role of node 2 started logging S3 errors during lease renewal:

  • S3 precondition failure (HTTP 412) while trying to acquire/renew the lease:
    • S3Exception: At least one of the pre-conditions you specified did not hold (Status Code: 412)
  • The NodeIdProvider logs clearly indicate:
    • “Failed to renew the lease: process is going to shut down immediately.”
    • “NodeIdProvider terminating the process.”

Once the NodeIdProvider decides the lease can’t be renewed safely, the broker begins a controlled shutdown. From the outside, this looks like a broker failure triggered by lease validation logic, not by ECS itself.

March 25, 2026, 11:20
[2026-03-25 10:20:41.663] [NodeIdProvider] WARN io.camunda.zeebe.dynamic.nodeid.RepositoryNodeIdProvider - Failed to renew the lease: process is going to shut down immediately. software.amazon.awssdk.services.s3.model.S3Exception: At least one of the pre-conditions you specified did not hold ...
March 25, 2026, 11:20
[2026-03-25 10:20:41.663] [NodeIdProvider] WARN io.camunda.zeebe.broker.NodeIdProviderConfiguration - NodeIdProvider terminating the process

Replacement task and new lease

ECS notices that the service is now below the desired task count and starts a replacement task for the orchestration cluster:

  1. The old task transitions to stopped.

  2. A new ECS task is started for the same service.

  3. On startup, the new task:

    • Acquires a new S3 lease for node 2 with version 3:
      • New taskid in metadata (the new ECS task id),
      • "version": "3",
      • "acquirable": "true".
    • Initializes a new data directory:
      • /usr/local/camunda/data/node-2/v3 is created by copying from /usr/local/camunda/data/node-2/v2.
    • Starts rest of the services and joins the cluster.
March 25, 2026, 11:24
[2026-03-25 10:24:12.031] [main] INFO io.camunda.zeebe.dynamic.nodeid.fs.VersionedNodeIdBasedDataDirectoryProvider - Initializing data directory /usr/local/camunda/data/node-2/v3 by copying from /usr/local/camunda/data/node-2/v2
orchestration-cluster
March 25, 2026, 11:24
[2026-03-25 10:24:11.480] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.RepositoryNodeIdProvider - Acquired lease w/ nodeId=NodeInstance[id=2, version=Version[version=3]]. Initialized[metadata=Metadata[task=Optional[6670d8e5-ec20-4c03-9d99-6993f48b6617], version=Version[version=3], acquirable=true], lease=Lease[taskId=6670d8e5-ec20-4c03-9d99-6993f48b6617, timestamp=1774434266329, nodeInstance=NodeInstance[id=2, version=Version[version=3]], knownVersionMappings=VersionMappings[mappingsByNodeId={2=Version[version=3]}]], eTag="98e0408dfd5c07143686697207d411df"]

The resulting S3 object now represents node 2, version 3, with a fresh lease owned by the new task.

In ECS:

  • The Tasks view shows three running tasks, all Healthy again.
  • From the cluster’s perspective, we’re back to a stable, 3-broker topology.

Healthy Service

Takeaways

Overall, this experiment shows that when a broker loses its lease and another broker acquires it, the combination of NodeIdProvider safety checks and ECS rescheduling steers the system toward a safe recovery path rather than silent data corruption.

C8 on ECS: Restart Tasks

· 5 min read
Deepthi Akkoorath
Principal Software Engineer @ Camunda
Rodrigo Lopes
Associate Software Engineer @ Zeebe

With 8.9, we support C8 deployments on ECS. Camunda 8 is originally designed for Kubernetes StatefulSets, where each broker has a stable identity and disk. On Amazon ECS, tasks are ephemeral: IPs and container instances change frequently, and you rely on external storage like EFS and S3 instead of node-local disks.

To make this work safely, the Camunda 8 ECS reference architecture introduces a dynamic NodeIdProvider backed by Amazon S3. Each ECS task:

  • Competes for a lease stored in S3 that represents a specific logical broker node ID.
  • When it acquires the lease, it becomes that broker and uses a dedicated directory on shared EFS for its data.
  • Periodically renews the lease; if renewal fails or preconditions are violated, the task shts down immediately to avoid corrupting data or having two brokers think they own the same node.

The experiments in this post tests how well this S3-backed lease mechanism behaves under specfic failure scenarios where a task is killed and replaced by a new one.

Experiment

Our first chaos experiment on ECS was simple: what happens to a Camunda 8 cluster on AWS ECS when we kill a single broker task by hand?

The cluster was running Camunda 8 (Zeebe) on AWS ECS with 3 brokers and 3 partitions. Before we started the experiment, the dashboards showed a healthy topology, stable processing and exporting rates. The AWS console confirmed three running, healthy tasks for the orchestration cluster service.

Baseline: healthy 3-broker cluster

At steady state:

  • Cluster topology: 3 brokers, each participating in the 3 partitions as leader or follower.
  • Health: All partitions reported as healthy, with no restarts.
  • Throughput: Processing and exporting metrics were flat and stable.
  • ECS: Service view showed 3/3 tasks running and healthy.

Dashboard showing healthy brokers

AWS console showing healthy tasks

Injecting failure: stopping one ECS task

To inject a failure, we manually stopped one of the ECS tasks for the orchestration cluster from the AWS console.

Stop task from AWS console

This triggers a graceful shutdown of the broker, and we can see that NodeIdProvider released its S3 lease.

March 25, 2026, 10:34
[2026-03-25 09:34:19.666] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.repository.s3.S3NodeIdRepository - Release lease Initialized[metadata=Metadata[task=Optional[03acfc2a-6ff8-4e76-8e56-0a2a4e7227e7], version=Version[version=1], acquirable=true], lease=Lease[taskId=03acfc2a-6ff8-4e76-8e56-0a2a4e7227e7, timestamp=1774431273727, nodeInstance=NodeInstance[id=1, version=Version[version=1]], knownVersionMappings=VersionMappings[mappingsByNodeId={0=Version[version=1], 1=Version[version=1], 2=Version[version=1]}]], eTag="07b2daecf534e87cae5a3993f1102b22"]
orchestration-cluster
March 25, 2026, 10:34
[2026-03-25 09:34:19.638] [SpringApplicationShutdownHook] [{broker-id=Broker-1}] INFO io.camunda.zeebe.broker.system - Broker shut down.
orchestration-cluster```

Replacement task and recovery

ECS replaces the stopped task to meet the configured desired task count.

  1. The old task went into deprovisioning and eventually stopped.

Deprovisioning

  1. ECS launched a new task for the same service a couple of minutes later.

Provisioning 3. On startup, the new broker instance:

  • Acquired the S3 lease for the same logical node with a new version (v2).
  • Copied the previous data directory into a fresh v2 directory (versioned data layout).
March 25, 2026, 10:36
[2026-03-25 09:36:27.555] [main] INFO io.camunda.zeebe.dynamic.nodeid.fs.VersionedNodeIdBasedDataDirectoryProvider - Initializing data directory /usr/local/camunda/data/node-1/v2 by copying from /usr/local/camunda/data/node-1/v1
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:27.037] [main] WARN io.camunda.configuration.beanoverrides.BrokerBasedPropertiesOverride - The following legacy property is no longer supported and should be removed in favor of 'camunda.data.exporters': zeebe.broker.exporters
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:26.979] [main] WARN io.camunda.configuration.UnifiedConfigurationHelper - The following legacy configuration properties should be removed in favor of 'camunda.data.primary-storage.directory': zeebe.broker.data.directory
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:26.912] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.RepositoryNodeIdProvider - Acquired lease w/ nodeId=NodeInstance[id=1, version=Version[version=2]]. Initialized[metadata=Metadata[task=Optional[5228b3d3-7cde-4365-b4c5-7afd0ae094cd], version=Version[version=2], acquirable=true], lease=Lease[taskId=5228b3d3-7cde-4365-b4c5-7afd0ae094cd, timestamp=1774431401724, nodeInstance=NodeInstance[id=1, version=Version[version=2]], knownVersionMappings=VersionMappings[mappingsByNodeId={1=Version[version=2]}]], eTag="9f0c6e1c2a92bbaa1fde872e1d545e05"]
orchestration-cluster

The new task becomes healthy and the orchestration cluster service is now fully healthy.

Recovered

What we learned

This first experiment validated that:

  • S3-based leases behave correctly under node loss: when a task is killed, the broker releases its lease, and a new task can safely acquire a new versioned lease.
  • Graceful shutdown still happens under forced task stop: even though we stopped the task from the ECS console, the broker had enough time to drain and shut down its internal components cleanly.
  • Replace task becomes healthy: the replacement task comes up, reuses the data via a new versioned directory, and rejoins the cluster without any issues.

RTO with varying backup schedules

· 7 min read
Lena Schönburg
Senior Software Engineer @ Zeebe

With the upcoming Camunda 8.9 release, we will support RDBMS as secondary storage as an alternative to Elasticsearch and OpenSearch. Because there is no common API for taking backups of relational databases, we had to revise our approach to backup and restore significantly. We now support a continuous backup mode that allows users to take backups of secondary and primary storage independently from each other. Backups of primary storage will cover a contiguous time range, allowing us to restore from one or multiple primary storage backups to match the state in secondary storage.

In this chaos day, we are testing our Recovery Time Objective (RTO), the time it takes to recover data from backups and become fully operational again, with varying backup schedules. When backups are taken less frequently, each backup covers a longer time window and therefore includes more accumulated log segments. We want to understand how this translates to RTO.

Checkpoint scheduler resiliency

· 8 min read
Panagiotis Goutis
Software Engineer @ Zeebe

With the introduction of RDBMS support in Camunda 8.9, we needed a reliable and consistent mechanism to automatically back up Zeebe's primary storage. It was mandatory to have a resilient scheduling mechanism that was able to survive cluster disruptions. To address this, we introduced scheduled backups, which allow operators to have a deterministic mechanism to back up the cluster's primary storage.

This experiment validates the resiliency of the Checkpoint Scheduler services under adverse conditions. Specifically, we test whether this critical service can maintain its configured schedule when brokers are disconnected and cluster topology changes occur, ensuring continuous backup operation even during partial cluster failures.

Scheduler introduction

To guarantee the continuity and correctness of primary storage backups, it was required to have a cluster-level service responsible for performing backups at predefined intervals. For this reason, we introduced the Checkpoint Scheduler, which serves as the timekeeper of checkpoint creation in the cluster, fanning out the creation of checkpoints to all partitions.

The scheduler is always assigned to the broker with the lowest id that is part of the replication cluster. Under normal operation, that would mean that the camunda-0 pod is the one with the service registered. The scheduler's interval, while preconfigured, is dynamic and will adapt to network issues in a best-effort manner to maintain the desired interval.

Currently, it supports two types of checkpoints:

  • MARKER: Used as reference points for point-in-time restore operation
  • SCHEDULED_BACKUP: Trigger a primary storage backup

Alongside the scheduler, a retention service is also registered on the same broker if a retention schedule is configured. This service is responsible for deleting backups outside the configured window to reduce storage costs. Furthermore, backups that are too old are not that useful in a disaster recovery scenario.

The checkpoint scheduler and the retention mechanism can be configured via the available options.

Chaos experiment

Expected outcomes

The expectation of this experiment is to prove that the checkpoint and backup scheduler is resilient to network and topology changes that can occur within a cluster's lifespan.

Setup

In this experiment, we'll be using a standard Camunda 8.9 Kubernetes installation with the checkpoint scheduler and retention enabled. The installation consists of 3 brokers in the camunda stateful set, labeled as camunda-0, camunda-1 and camunda-2.

Enabling the scheduler

To enable the checkpoint & backup schedulers, we supply the following configuration parameters for the camunda stateful set:

CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CONTINUOUS=true
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CHECKPOINTINTERVAL=PT1M
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_SCHEDULE=PT3M
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_RETENTION_WINDOW=PT30M
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_RETENTION_CLEANUPSCHEDULE=PT10M

This means that we take a full Zeebe backup every 3 minutes and inject marker checkpoints into the log stream every 1 minute. We also want to maintain a rolling window of 30 minutes' worth of backups, and we check for backups to be deleted every 10 minutes.

Node disconnect

To simulate disconnecting a node from the Camunda Orchestration cluster, we can use a simple Kubernetes network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: isolate-pod
namespace: <namespace>
spec:
podSelector:
matchLabels:
isolated: "true"
policyTypes:
- Ingress
- Egress

This policy matches the given label isolated present on deployed pods; disconnecting a node just requires applying this label. For example, kubectl label pod camunda-0 isolated=true --overwrite will result in the pod camunda-0 being removed from the Orchestration cluster.

Experiment

Starting the cluster

Upon applying this configuration we see the following logs being produced, verifying the presence of the scheduler:

logs-image

You may notice that these logs are present on all brokers; this is intentional, as the service being active is based on the intra-cluster discovery protocol, which is updated in real-time. This means that when, for example, camunda-0 is considered gone, camunda-1 is ready to take over the service and pick up where it should.

Also, the first backup has already been taken. As there was no previous backup present, an immediate one is captured. With that in mind, if a cluster sustains a prolonged unhealthy state for more than the configured backup interval, the next backup will be immediately taken once the cluster reaches a healthy state.

Our Grafana dashboard will start displaying scheduler data:

startup-metrics

After the first 3 minutes pass, according to the backup interval, we should have a second backup available:

first-backup-metrics

and the corresponding logs related to it with the matching timestamps:

first-backup-logs

Inspecting the logs further, we see that the initiating node of that backup was the pod camunda-0, which is expected.

first-backup-pod

The checkpoints can also be verified by querying Zeebe's internal state via the actuator. Notice in the following response that the checkpointId for the backups and for the active ranges matches what's seen in the metrics as well.

The result contains a single partition's state to reduce size

curl localhost:9600/actuator/backupRuntime/state

{
"checkpointStates": [
{
"checkpointId": 1773840131794,
"checkpointType": "MARKER",
"partitionId": 2,
"checkpointPosition": 1695,
"checkpointTimestamp": "2026-03-18T13:22:11.781+0000"
}
],
"backupStates": [
{
"checkpointId": 1773840011037,
"checkpointType": "SCHEDULED_BACKUP",
"partitionId": 2,
"checkpointPosition": 1465,
"firstLogPosition": 1,
"checkpointTimestamp": "2026-03-18T13:20:13.001+0000"
}
],
"ranges": [
{
"partitionId": 2,
"start": {
"checkpointId": 1773839828707,
"checkpointType": "SCHEDULED_BACKUP",
"checkpointPosition": 1095,
"firstLogPosition": 1,
"checkpointTimestamp": "2026-03-18T13:17:10.852+0000"
},
"end": {
"checkpointId": 1773840011037,
"checkpointType": "SCHEDULED_BACKUP",
"checkpointPosition": 1465,
"firstLogPosition": 1,
"checkpointTimestamp": "2026-03-18T13:20:13.001+0000"
}
}
]
}

Disconnecting camunda-0

Executing kubectl label pod camunda-0 isolated=true --overwrite causes broker 0 to be disconnected from the cluster. For the scheduling service, this effectively means that it should be handed over to the camunda-1 broker. Sure enough, the logs confirm this:

handover-logs

handover-logs-pod

The logs clearly show that the next backup is scheduled in 46,301ms (approximately 46 seconds), which is significantly less than the configured 180,000ms (3-minute) interval. This behavior is intentional and demonstrates the scheduler's dynamic interval adjustment. Rather than waiting a full 3 minutes from the handover, the scheduler calculates the remaining time based on when the last backup completed. This approach ensures that backups remain on schedule even when the scheduler service transfers between brokers, preventing schedule drift that would occur if the interval were reset on each handover.

Disconnecting camunda-1

Applying the isolated label on camunda-1 causes the cluster to reach an unhealthy state, since it has now suffered 2 node losses. The remaining node, camunda-2, cannot form a cluster and is unable to proceed in the startup sequence to start initiating backups.

Rejoining brokers

Removing the label from the disconnected pods,

kubectl label pod camunda-0 isolated-
kubectl label pod camunda-1 isolated-

causes yet another handover, this time back to the camunda-0 node, and the schedule's execution continues as expected.

next-backup

next-backup-execution

Bonus: Retention

Having the cluster running long enough causes retention to kick in, so its metrics and results are also available in the dashboard. We can see that we have 3 backups deleted for each partition.

retention

We also see the earliest backup still present that was not picked up by retention, 1773840375165. Looking in the logs, we can also confirm its execution time, 15:26.

retention-earliest

Since our backups started at 15:17, we expect to have backups available at the following timestamps:

  • 15:17
  • 15:20
  • 15:23
  • 15:26
  • 15:29...

Since retention was executed at 15:55, the reported count of backups pruned is on par with what's expected. Backups taken at 15:17, 15:20, and 15:23 all satisfy being 30 minutes before the retention mechanism execution.

Conclusion

In this experiment, we've validated that the checkpoint and backup scheduler can maintain the configured backup interval while surviving broker disconnects and topology changes. The service successfully demonstrated automatic failover, transferring from camunda-0 to camunda-1 when the primary broker was isolated, and seamlessly returning control when the original broker rejoined the cluster.

Key findings include:

  1. Automatic failover: The scheduler service correctly reassigns to the next available broker with the lowest ID when the current scheduler node becomes unavailable
  2. Dynamic interval adjustment: The scheduler adapts its timing to compensate for disruptions, ensuring backups remain on schedule despite temporary delays
  3. Retention reliability: The retention mechanism functions properly, maintaining the configured rolling window and properly cleaning up expired backups

These results confirm that operators can rely on the checkpoint scheduler to maintain continuous backup operations even with cluster-level disruptions, guaranteeing the system's ability to properly maintain the configured schedule without manual intervention.

Comparing backup stores for scheduled backups

· 5 min read
Panagiotis Goutis
Software Engineer @ Zeebe

With RDBMS support added in Camunda 8.9, we needed a reliable way to back up Zeebe's primary storage. We introduced scheduled backups that allow operators to configure backup intervals. Since backup operations are processed through Zeebe's logstream like other operations, they benefit from the same consistency guarantees that the engine provides.

Since our goal is to achieve the highest possible RPO (Recovery Point Objective) without sacrificing processing throughput, we’ve made several improvements across the supported backup stores. This experiment measures where we currently stand in practice.

Within the bounds of this experiment, we compare backup-store performance across the three major cloud providers: Google Cloud Storage (GCS), AWS S3, and Azure Blob Storage.

Chaos experiment

Expected outcomes

The expectations of this experiment are to:

  • Assess how well scheduled backups meet our RPO requirements under sustained high load.
  • Provide a rule of thumb for configuring the scheduler’s backup interval based on cluster usage and runtime state size.

Setup

The experiment uses a max-throughput benchmark from our Camunda load tests project, running on its own Kubernetes cluster.

The cluster uses a standard setup of 3 Zeebe brokers with 3 partitions and a replication factor of 3. Zeebe brokers are provisioned with 2 GiB of memory and 3 CPUs, similar to what a base 1x cluster is.

The benchmark scenario is fairly simple:

  • A single service-task process definition
  • An external client creating new process instances at a rate of 300 PIs/sec, including large variables
  • Three workers completing service tasks with a configured delay of 500 ms, also injecting large variables into the process instance scope

We inject large variables to increase Zeebe’s runtime state, which increases the RocksDB snapshot size and therefore the overall required backup size.

Primary storage backup size is influenced by two factors:

  • The size of the cluster's runtime state, represented in RocksDB snapshots
  • The amount of log segments still present in Zeebe's data directory

For RDBMS installations, we highly recommend enabling continuous backups. When continuous backups are enabled, log segment compaction is bound by the latest backed up position. If backups run infrequently on a high-throughput cluster, more segments accumulate between backup runs, increasing the amount of data to upload. To estimate a cluster's segment storage, multiply the atomix_segment_count metric by 128 (the default segment size in MB).

Experiment

To introduce scheduled backups, we configured the Zeebe brokers as follows:

CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CONTINUOUS=true
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CHECKPOINTINTERVAL=PT30S
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_SCHEDULE=PT2M

See the documentation for configuration option definitions. In short, we take a full Zeebe backup every 2 minutes and inject marker checkpoints into the log stream every 30 seconds.

Measurement

We sampled results using the provided Grafana dashboard.

  • Backup size: Approximated via RocksDB live data size per partition (metric: zeebe_rocksdb_live_estimate_live_data_size).
  • Backup duration: Captured via the dashboard panel Take Backup Latency. The expression's window was calibrated to 10 minutes (instead of 1 hour) to better capture latency in our setup. With backups taken every two minutes, the latency averages a window of roughly 5 backup executions

_Throughout the experiment, we aimed to maintain >80% cluster load while sustaining ~300 process instances per second.

Results

Results were collected over an average of three benchmark runs for each backup store.

SizeGoogle Cloud StorageAWS S3Azure Blob Storage
450MB8s8s24s
660MB10s10s26s
800MB13s11s28s
1GB15s13s30s
1.7GB18s18s35s
2GB21s19s40s
3GB25s25s50s
4GB30s30s78s

Based on the collected data points, all backup stores behave roughly linearly with respect to runtime state size, which makes it straightforward to extrapolate expected backup latency.

snapshot-comparison

Zeebe actors distinguish between CPU-bound and I/O-bound tasks, with CPU-bound tasks taking precedence. Because our scenario sustains high CPU utilization, backup completion time is impacted. Capturing the same runtime state sizes without load can reduce backup time by up to ~50%.

Conclusion

Maximizing your RPO means having backups available as close to the failure point as possible. With scheduled backups, this becomes more feasible while being backed up by the engine’s processing guarantees.

Runtime state size is only one of the factors affecting backup completion time and provides a good starting reference point. During our experiments, throughput interference was minimal—dare I say barely noticeable.

As a rule of thumb, the backup schedule's interval should be higher than the backup completion latency. Multiple in-flight backups can potentially hinder cluster performance. The provided Grafana dashboards make it straightforward to track these metrics and configure scheduled backups accordingly.

Future work

During these experiments, we also investigated:

  • Utilizing transparent GZIP compression for backup contents
  • Pre-compressing backup contents

These approaches yielded improvements in backup completion latency, but added overhead on the processing side, slightly reducing overall throughput. These were draft investigatory implementations for future reference.

An improvement that would most likely have the largest impact in taking a backups is performing proper RocksDB incremental snapshots, since that would minimize the amount of data required per backup. However, this approach come with it's own problems to tackle, not being that straightforward either.

Elastic restart impact on Camunda

· 6 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's Chaos Day, we explored the impact of Elasticsearch availability on Camunda 8.9+ (testing against main).

While we already tested last year the resiliency of our System against ES restarts (see previous post, we have run the OC cluster only. Additionally, certain configurations have been improved (default replica configurations, etc.).

This time, we wanted to see how the system behaves with OC + ES Exporter + Optimize enabled.

I was joined by Jon and Pranjal, the newest members of the reliability testing team.

TL;DR; While we found that short ES unavailability does not affect processing performance, depending on the configuration, it can affect data availability. For longer outages, this would then also impact Camunda processing. To mitigate this problem, corresponding exporters should be configured, but the necessary configurations are not properly exposed and need to be fixed in the Helm Chart.

data-avail

Experimenting with data availability metric

· 9 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

Happy New Year, everyone 🎉! Time for some chaos experiments again 😃.

In today's chaos day, I was joined by Pranjal, our newest addition to the reliability testing team at Camunda (welcome 🎉)

We planned to experiment with the new data availability metric, which we have recently added to our load testing infrastructure, for more details see related PR. In short, we measure the time from creating a process instance until it is actually available to the user via the API. This allows us to reason how long it also takes for Operate to show new data.

The goal for today was to gain a better understanding of how the system behaves under higher loads and how this affects data availability. The focus was set here on the orchestration cluster, meaning data availability for Operate and Tasklist.

TL;DR: We have observed that increasing the process instance creation rate results in higher data availability times. While experimenting with different workloads, we discovered that the typical load test is still not working well. During our investigation of the platform behaviors, we found a recently introduced regression that is limiting our general maximum throughput. We also identified suboptimal error handling in the Gateway, which causes request retries and can exacerbate load issues.

comparison-latency.png

Building Confidence at Scale: How Camunda Ensures Platform Reliability Through Continuous Testing

· 8 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

As businesses increasingly rely on process automation for their critical operations, the question of reliability becomes paramount. How can you trust that your automation platform will perform consistently under pressure, recover gracefully from failures, and maintain performance over time?

At Camunda, we've been asking ourselves these same questions for years, and today I want to share how our reliability testing practices have evolved to ensure our platform meets the demanding requirements of enterprise-scale deployments. I will also outline our plans to further invest in this crucial area.

Stress testing Camunda

· 12 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's chaos experiment, we focused on stress-testing the Camunda 8 orchestration cluster under high-load conditions. We simulated a large number of concurrent process instances to evaluate the performance of processing and system reliability.

Due to our recent work in supporting load tests for different versions, we were able to compare how different Camunda versions handle stress.

TL;DR: Overall, we saw that all versions of the Camunda 8 orchestration cluster (with focus on the processing) are robust and can handle high loads effectively and reliably. In consideration of throughput and latency, with similar resource allocation among the brokers, 8.7.x outperforms other versions. If we consider our streamlined architecture (which now contains more components in a single application) and align the resources for 8.8.x, it can achieve similar throughput levels as 8.7.x, while maintaining significantly lower latency (a factor of 2). An overview of the results can be found in the Results section below.

info

[Update: 28.11.2025]

After the initial analysis, we conducted further experiments with 8.8 to understand why the measured processing performance was lower compared to 8.7.x. The blog post (including TL;DR) has been updated with the new findings in the section Further Experiments below.