C8 on ECS: Restart Tasks

March 25, 2026 · 5 min read

Principal Software Engineer @ Camunda

Associate Software Engineer @ Zeebe

With 8.9, we support C8 deployments on ECS. Camunda 8 is originally designed for Kubernetes StatefulSets, where each broker has a stable identity and disk. On Amazon ECS, tasks are ephemeral: IPs and container instances change frequently, and you rely on external storage like EFS and S3 instead of node-local disks.

To make this work safely, the Camunda 8 ECS reference architecture introduces a dynamic NodeIdProvider backed by Amazon S3. Each ECS task:

Competes for a lease stored in S3 that represents a specific logical broker node ID.
When it acquires the lease, it becomes that broker and uses a dedicated directory on shared EFS for its data.
Periodically renews the lease; if renewal fails or preconditions are violated, the task shts down immediately to avoid corrupting data or having two brokers think they own the same node.

The experiments in this post tests how well this S3-backed lease mechanism behaves under specfic failure scenarios where a task is killed and replaced by a new one.

Experiment

Our first chaos experiment on ECS was simple: what happens to a Camunda 8 cluster on AWS ECS when we kill a single broker task by hand?

The cluster was running Camunda 8 (Zeebe) on AWS ECS with 3 brokers and 3 partitions. Before we started the experiment, the dashboards showed a healthy topology, stable processing and exporting rates. The AWS console confirmed three running, healthy tasks for the orchestration cluster service.

Baseline: healthy 3-broker cluster

At steady state:

Cluster topology: 3 brokers, each participating in the 3 partitions as leader or follower.
Health: All partitions reported as healthy, with no restarts.
Throughput: Processing and exporting metrics were flat and stable.
ECS: Service view showed 3/3 tasks running and healthy.

Dashboard showing healthy brokers

AWS console showing healthy tasks

Injecting failure: stopping one ECS task

To inject a failure, we manually stopped one of the ECS tasks for the orchestration cluster from the AWS console.

Stop task from AWS console

This triggers a graceful shutdown of the broker, and we can see that NodeIdProvider released its S3 lease.

March 25, 2026, 10:34
[2026-03-25 09:34:19.666] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.repository.s3.S3NodeIdRepository - Release lease Initialized[metadata=Metadata[task=Optional[03acfc2a-6ff8-4e76-8e56-0a2a4e7227e7], version=Version[version=1], acquirable=true], lease=Lease[taskId=03acfc2a-6ff8-4e76-8e56-0a2a4e7227e7, timestamp=1774431273727, nodeInstance=NodeInstance[id=1, version=Version[version=1]], knownVersionMappings=VersionMappings[mappingsByNodeId={0=Version[version=1], 1=Version[version=1], 2=Version[version=1]}]], eTag="07b2daecf534e87cae5a3993f1102b22"]
orchestration-cluster
March 25, 2026, 10:34
[2026-03-25 09:34:19.638] [SpringApplicationShutdownHook] [{broker-id=Broker-1}] INFO io.camunda.zeebe.broker.system - Broker shut down.
orchestration-cluster```

Replacement task and recovery

ECS replaces the stopped task to meet the configured desired task count.

The old task went into deprovisioning and eventually stopped.

Deprovisioning

ECS launched a new task for the same service a couple of minutes later.

Provisioning 3. On startup, the new broker instance:

Acquired the S3 lease for the same logical node with a new version (v2).
Copied the previous data directory into a fresh v2 directory (versioned data layout).

March 25, 2026, 10:36
[2026-03-25 09:36:27.555] [main] INFO io.camunda.zeebe.dynamic.nodeid.fs.VersionedNodeIdBasedDataDirectoryProvider - Initializing data directory /usr/local/camunda/data/node-1/v2 by copying from /usr/local/camunda/data/node-1/v1
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:27.037] [main] WARN io.camunda.configuration.beanoverrides.BrokerBasedPropertiesOverride - The following legacy property is no longer supported and should be removed in favor of 'camunda.data.exporters': zeebe.broker.exporters
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:26.979] [main] WARN io.camunda.configuration.UnifiedConfigurationHelper - The following legacy configuration properties should be removed in favor of 'camunda.data.primary-storage.directory': zeebe.broker.data.directory
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:26.912] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.RepositoryNodeIdProvider - Acquired lease w/ nodeId=NodeInstance[id=1, version=Version[version=2]]. Initialized[metadata=Metadata[task=Optional[5228b3d3-7cde-4365-b4c5-7afd0ae094cd], version=Version[version=2], acquirable=true], lease=Lease[taskId=5228b3d3-7cde-4365-b4c5-7afd0ae094cd, timestamp=1774431401724, nodeInstance=NodeInstance[id=1, version=Version[version=2]], knownVersionMappings=VersionMappings[mappingsByNodeId={1=Version[version=2]}]], eTag="9f0c6e1c2a92bbaa1fde872e1d545e05"]
orchestration-cluster

The new task becomes healthy and the orchestration cluster service is now fully healthy.

Recovered

What we learned

This first experiment validated that:

S3-based leases behave correctly under node loss: when a task is killed, the broker releases its lease, and a new task can safely acquire a new versioned lease.
Graceful shutdown still happens under forced task stop: even though we stopped the task from the ECS console, the broker had enough time to drain and shut down its internal components cleanly.
Replace task becomes healthy: the replacement task comes up, reuses the data via a new versioned directory, and rejoins the cluster without any issues.

RTO with varying backup schedules

March 19, 2026 · 7 min read

Lena Schönburg

Senior Software Engineer @ Zeebe

With the upcoming Camunda 8.9 release, we will support RDBMS as secondary storage as an alternative to Elasticsearch and OpenSearch. Because there is no common API for taking backups of relational databases, we had to revise our approach to backup and restore significantly. We now support a continuous backup mode that allows users to take backups of secondary and primary storage independently from each other. Backups of primary storage will cover a contiguous time range, allowing us to restore from one or multiple primary storage backups to match the state in secondary storage.

In this chaos day, we are testing our Recovery Time Objective (RTO), the time it takes to recover data from backups and become fully operational again, with varying backup schedules. When backups are taken less frequently, each backup covers a longer time window and therefore includes more accumulated log segments. We want to understand how this translates to RTO.

Checkpoint scheduler resiliency

March 18, 2026 · 8 min read

Panagiotis Goutis

Software Engineer @ Zeebe

With the introduction of RDBMS support in Camunda 8.9, we needed a reliable and consistent mechanism to automatically back up Zeebe's primary storage. It was mandatory to have a resilient scheduling mechanism that was able to survive cluster disruptions. To address this, we introduced scheduled backups, which allow operators to have a deterministic mechanism to back up the cluster's primary storage.

This experiment validates the resiliency of the Checkpoint Scheduler services under adverse conditions. Specifically, we test whether this critical service can maintain its configured schedule when brokers are disconnected and cluster topology changes occur, ensuring continuous backup operation even during partial cluster failures.

Scheduler introduction

To guarantee the continuity and correctness of primary storage backups, it was required to have a cluster-level service responsible for performing backups at predefined intervals. For this reason, we introduced the Checkpoint Scheduler, which serves as the timekeeper of checkpoint creation in the cluster, fanning out the creation of checkpoints to all partitions.

The scheduler is always assigned to the broker with the lowest id that is part of the replication cluster. Under normal operation, that would mean that the camunda-0 pod is the one with the service registered. The scheduler's interval, while preconfigured, is dynamic and will adapt to network issues in a best-effort manner to maintain the desired interval.

Currently, it supports two types of checkpoints:

MARKER: Used as reference points for point-in-time restore operation
SCHEDULED_BACKUP: Trigger a primary storage backup

Alongside the scheduler, a retention service is also registered on the same broker if a retention schedule is configured. This service is responsible for deleting backups outside the configured window to reduce storage costs. Furthermore, backups that are too old are not that useful in a disaster recovery scenario.

The checkpoint scheduler and the retention mechanism can be configured via the available options.

Chaos experiment

Expected outcomes

The expectation of this experiment is to prove that the checkpoint and backup scheduler is resilient to network and topology changes that can occur within a cluster's lifespan.

Setup

In this experiment, we'll be using a standard Camunda 8.9 Kubernetes installation with the checkpoint scheduler and retention enabled. The installation consists of 3 brokers in the camunda stateful set, labeled as camunda-0, camunda-1 and camunda-2.

Enabling the scheduler

To enable the checkpoint & backup schedulers, we supply the following configuration parameters for the camunda stateful set:

CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CONTINUOUS=true
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CHECKPOINTINTERVAL=PT1M
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_SCHEDULE=PT3M
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_RETENTION_WINDOW=PT30M
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_RETENTION_CLEANUPSCHEDULE=PT10M

This means that we take a full Zeebe backup every 3 minutes and inject marker checkpoints into the log stream every 1 minute. We also want to maintain a rolling window of 30 minutes' worth of backups, and we check for backups to be deleted every 10 minutes.

Node disconnect

To simulate disconnecting a node from the Camunda Orchestration cluster, we can use a simple Kubernetes network policy:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: isolate-pod
  namespace: <namespace>
spec:
  podSelector:
    matchLabels:
      isolated: "true"
  policyTypes:
  - Ingress
  - Egress

This policy matches the given label isolated present on deployed pods; disconnecting a node just requires applying this label. For example, kubectl label pod camunda-0 isolated=true --overwrite will result in the pod camunda-0 being removed from the Orchestration cluster.

Experiment

Starting the cluster

Upon applying this configuration we see the following logs being produced, verifying the presence of the scheduler:

logs-image

You may notice that these logs are present on all brokers; this is intentional, as the service being active is based on the intra-cluster discovery protocol, which is updated in real-time. This means that when, for example, camunda-0 is considered gone, camunda-1 is ready to take over the service and pick up where it should.

Also, the first backup has already been taken. As there was no previous backup present, an immediate one is captured. With that in mind, if a cluster sustains a prolonged unhealthy state for more than the configured backup interval, the next backup will be immediately taken once the cluster reaches a healthy state.

Our Grafana dashboard will start displaying scheduler data:

startup-metrics

After the first 3 minutes pass, according to the backup interval, we should have a second backup available:

first-backup-metrics

and the corresponding logs related to it with the matching timestamps:

first-backup-logs

Inspecting the logs further, we see that the initiating node of that backup was the pod camunda-0, which is expected.

first-backup-pod

The checkpoints can also be verified by querying Zeebe's internal state via the actuator. Notice in the following response that the checkpointId for the backups and for the active ranges matches what's seen in the metrics as well.

The result contains a single partition's state to reduce size

curl localhost:9600/actuator/backupRuntime/state

{
  "checkpointStates": [
    {
      "checkpointId": 1773840131794,
      "checkpointType": "MARKER",
      "partitionId": 2,
      "checkpointPosition": 1695,
      "checkpointTimestamp": "2026-03-18T13:22:11.781+0000"
    }
  ],
  "backupStates": [
    {
      "checkpointId": 1773840011037,
      "checkpointType": "SCHEDULED_BACKUP",
      "partitionId": 2,
      "checkpointPosition": 1465,
      "firstLogPosition": 1,
      "checkpointTimestamp": "2026-03-18T13:20:13.001+0000"
    }
  ],
  "ranges": [
    {
      "partitionId": 2,
      "start": {
        "checkpointId": 1773839828707,
        "checkpointType": "SCHEDULED_BACKUP",
        "checkpointPosition": 1095,
        "firstLogPosition": 1,
        "checkpointTimestamp": "2026-03-18T13:17:10.852+0000"
      },
      "end": {
        "checkpointId": 1773840011037,
        "checkpointType": "SCHEDULED_BACKUP",
        "checkpointPosition": 1465,
        "firstLogPosition": 1,
        "checkpointTimestamp": "2026-03-18T13:20:13.001+0000"
      }
    }
  ]
}

Disconnecting camunda-0

Executing kubectl label pod camunda-0 isolated=true --overwrite causes broker 0 to be disconnected from the cluster. For the scheduling service, this effectively means that it should be handed over to the camunda-1 broker. Sure enough, the logs confirm this:

handover-logs

handover-logs-pod

The logs clearly show that the next backup is scheduled in 46,301ms (approximately 46 seconds), which is significantly less than the configured 180,000ms (3-minute) interval. This behavior is intentional and demonstrates the scheduler's dynamic interval adjustment. Rather than waiting a full 3 minutes from the handover, the scheduler calculates the remaining time based on when the last backup completed. This approach ensures that backups remain on schedule even when the scheduler service transfers between brokers, preventing schedule drift that would occur if the interval were reset on each handover.

Disconnecting camunda-1

Applying the isolated label on camunda-1 causes the cluster to reach an unhealthy state, since it has now suffered 2 node losses. The remaining node, camunda-2, cannot form a cluster and is unable to proceed in the startup sequence to start initiating backups.

Rejoining brokers

Removing the label from the disconnected pods,

kubectl label pod camunda-0 isolated-
kubectl label pod camunda-1 isolated-

causes yet another handover, this time back to the camunda-0 node, and the schedule's execution continues as expected.

next-backup

next-backup-execution

Bonus: Retention

Having the cluster running long enough causes retention to kick in, so its metrics and results are also available in the dashboard. We can see that we have 3 backups deleted for each partition.

retention

We also see the earliest backup still present that was not picked up by retention, 1773840375165. Looking in the logs, we can also confirm its execution time, 15:26.

Since our backups started at 15:17, we expect to have backups available at the following timestamps:

15:17
15:20
15:23
15:26
15:29...

Since retention was executed at 15:55, the reported count of backups pruned is on par with what's expected. Backups taken at 15:17, 15:20, and 15:23 all satisfy being 30 minutes before the retention mechanism execution.

Conclusion

In this experiment, we've validated that the checkpoint and backup scheduler can maintain the configured backup interval while surviving broker disconnects and topology changes. The service successfully demonstrated automatic failover, transferring from camunda-0 to camunda-1 when the primary broker was isolated, and seamlessly returning control when the original broker rejoined the cluster.

Key findings include:

Automatic failover: The scheduler service correctly reassigns to the next available broker with the lowest ID when the current scheduler node becomes unavailable
Dynamic interval adjustment: The scheduler adapts its timing to compensate for disruptions, ensuring backups remain on schedule despite temporary delays
Retention reliability: The retention mechanism functions properly, maintaining the configured rolling window and properly cleaning up expired backups

These results confirm that operators can rely on the checkpoint scheduler to maintain continuous backup operations even with cluster-level disruptions, guaranteeing the system's ability to properly maintain the configured schedule without manual intervention.

Comparing backup stores for scheduled backups

March 17, 2026 · 5 min read

Panagiotis Goutis

Software Engineer @ Zeebe

With RDBMS support added in Camunda 8.9, we needed a reliable way to back up Zeebe's primary storage. We introduced scheduled backups that allow operators to configure backup intervals. Since backup operations are processed through Zeebe's logstream like other operations, they benefit from the same consistency guarantees that the engine provides.

Since our goal is to achieve the highest possible RPO (Recovery Point Objective) without sacrificing processing throughput, we’ve made several improvements across the supported backup stores. This experiment measures where we currently stand in practice.

Within the bounds of this experiment, we compare backup-store performance across the three major cloud providers: Google Cloud Storage (GCS), AWS S3, and Azure Blob Storage.

Chaos experiment

Expected outcomes

The expectations of this experiment are to:

Assess how well scheduled backups meet our RPO requirements under sustained high load.
Provide a rule of thumb for configuring the scheduler’s backup interval based on cluster usage and runtime state size.

Setup

The experiment uses a max-throughput benchmark from our Camunda load tests project, running on its own Kubernetes cluster.

The cluster uses a standard setup of 3 Zeebe brokers with 3 partitions and a replication factor of 3. Zeebe brokers are provisioned with 2 GiB of memory and 3 CPUs, similar to what a base 1x cluster is.

The benchmark scenario is fairly simple:

A single service-task process definition
An external client creating new process instances at a rate of 300 PIs/sec, including large variables
Three workers completing service tasks with a configured delay of 500 ms, also injecting large variables into the process instance scope

We inject large variables to increase Zeebe’s runtime state, which increases the RocksDB snapshot size and therefore the overall required backup size.

Primary storage backup size is influenced by two factors:

The size of the cluster's runtime state, represented in RocksDB snapshots
The amount of log segments still present in Zeebe's data directory

For RDBMS installations, we highly recommend enabling continuous backups. When continuous backups are enabled, log segment compaction is bound by the latest backed up position. If backups run infrequently on a high-throughput cluster, more segments accumulate between backup runs, increasing the amount of data to upload. To estimate a cluster's segment storage, multiply the atomix_segment_count metric by 128 (the default segment size in MB).

Experiment

To introduce scheduled backups, we configured the Zeebe brokers as follows:

CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CONTINUOUS=true
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_CHECKPOINTINTERVAL=PT30S
CAMUNDA_DATA_PRIMARYSTORAGE_BACKUP_SCHEDULE=PT2M

See the documentation for configuration option definitions. In short, we take a full Zeebe backup every 2 minutes and inject marker checkpoints into the log stream every 30 seconds.

Measurement

We sampled results using the provided Grafana dashboard.

Backup size: Approximated via RocksDB live data size per partition (metric: zeebe_rocksdb_live_estimate_live_data_size).
Backup duration: Captured via the dashboard panel Take Backup Latency. The expression's window was calibrated to 10 minutes (instead of 1 hour) to better capture latency in our setup. With backups taken every two minutes, the latency averages a window of roughly 5 backup executions

_Throughout the experiment, we aimed to maintain >80% cluster load while sustaining ~300 process instances per second.

Results

Results were collected over an average of three benchmark runs for each backup store.

Size	Google Cloud Storage	AWS S3	Azure Blob Storage
450MB	8s	8s	24s
660MB	10s	10s	26s
800MB	13s	11s	28s
1GB	15s	13s	30s
1.7GB	18s	18s	35s
2GB	21s	19s	40s
3GB	25s	25s	50s
4GB	30s	30s	78s

Based on the collected data points, all backup stores behave roughly linearly with respect to runtime state size, which makes it straightforward to extrapolate expected backup latency.

snapshot-comparison

Zeebe actors distinguish between CPU-bound and I/O-bound tasks, with CPU-bound tasks taking precedence. Because our scenario sustains high CPU utilization, backup completion time is impacted. Capturing the same runtime state sizes without load can reduce backup time by up to ~50%.

Conclusion

Maximizing your RPO means having backups available as close to the failure point as possible. With scheduled backups, this becomes more feasible while being backed up by the engine’s processing guarantees.

Runtime state size is only one of the factors affecting backup completion time and provides a good starting reference point. During our experiments, throughput interference was minimal—dare I say barely noticeable.

As a rule of thumb, the backup schedule's interval should be higher than the backup completion latency. Multiple in-flight backups can potentially hinder cluster performance. The provided Grafana dashboards make it straightforward to track these metrics and configure scheduled backups accordingly.

Future work

During these experiments, we also investigated:

Utilizing transparent GZIP compression for backup contents
Pre-compressing backup contents

These approaches yielded improvements in backup completion latency, but added overhead on the processing side, slightly reducing overall throughput. These were draft investigatory implementations for future reference.

An improvement that would most likely have the largest impact in taking a backups is performing proper RocksDB incremental snapshots, since that would minimize the amount of data required per backup. However, this approach come with it's own problems to tackle, not being that straightforward either.

Elastic restart impact on Camunda

March 12, 2026 · 6 min read

Christopher Kujawa

Principal Software Engineer @ Camunda

In today's Chaos Day, we explored the impact of Elasticsearch availability on Camunda 8.9+ (testing against main).

While we already tested last year the resiliency of our System against ES restarts (see previous post, we have run the OC cluster only. Additionally, certain configurations have been improved (default replica configurations, etc.).

This time, we wanted to see how the system behaves with OC + ES Exporter + Optimize enabled.

I was joined by Jon and Pranjal, the newest members of the reliability testing team.

TL;DR; While we found that short ES unavailability does not affect processing performance, depending on the configuration, it can affect data availability. For longer outages, this would then also impact Camunda processing. To mitigate this problem, corresponding exporters should be configured, but the necessary configurations are not properly exposed and need to be fixed in the Helm Chart.

data-avail

Experimenting with data availability metric

January 8, 2026 · 9 min read

Christopher Kujawa

Principal Software Engineer @ Camunda

Happy New Year, everyone 🎉! Time for some chaos experiments again 😃.

In today's chaos day, I was joined by Pranjal, our newest addition to the reliability testing team at Camunda (welcome 🎉)

We planned to experiment with the new data availability metric, which we have recently added to our load testing infrastructure, for more details see related PR. In short, we measure the time from creating a process instance until it is actually available to the user via the API. This allows us to reason how long it also takes for Operate to show new data.

The goal for today was to gain a better understanding of how the system behaves under higher loads and how this affects data availability. The focus was set here on the orchestration cluster, meaning data availability for Operate and Tasklist.

TL;DR: We have observed that increasing the process instance creation rate results in higher data availability times. While experimenting with different workloads, we discovered that the typical load test is still not working well. During our investigation of the platform behaviors, we found a recently introduced regression that is limiting our general maximum throughput. We also identified suboptimal error handling in the Gateway, which causes request retries and can exacerbate load issues.

Building Confidence at Scale: How Camunda Ensures Platform Reliability Through Continuous Testing

December 11, 2025 · 8 min read

Christopher Kujawa

Principal Software Engineer @ Camunda

As businesses increasingly rely on process automation for their critical operations, the question of reliability becomes paramount. How can you trust that your automation platform will perform consistently under pressure, recover gracefully from failures, and maintain performance over time?

At Camunda, we've been asking ourselves these same questions for years, and today I want to share how our reliability testing practices have evolved to ensure our platform meets the demanding requirements of enterprise-scale deployments. I will also outline our plans to further invest in this crucial area.

Stress testing Camunda

November 27, 2025 · 12 min read

Christopher Kujawa

Principal Software Engineer @ Camunda

In today's chaos experiment, we focused on stress-testing the Camunda 8 orchestration cluster under high-load conditions. We simulated a large number of concurrent process instances to evaluate the performance of processing and system reliability.

Due to our recent work in supporting load tests for different versions, we were able to compare how different Camunda versions handle stress.

TL;DR: Overall, we saw that all versions of the Camunda 8 orchestration cluster (with focus on the processing) are robust and can handle high loads effectively and reliably. In consideration of throughput and latency, with similar resource allocation among the brokers, 8.7.x outperforms other versions. If we consider our streamlined architecture (which now contains more components in a single application) and align the resources for 8.8.x, it can achieve similar throughput levels as 8.7.x, while maintaining significantly lower latency (a factor of 2). An overview of the results can be found in the Results section below.

info

[Update: 28.11.2025]

After the initial analysis, we conducted further experiments with 8.8 to understand why the measured processing performance was lower compared to 8.7.x. The blog post (including TL;DR) has been updated with the new findings in the section Further Experiments below.

Testing retention of historical PIs in Camunda 8.8

October 31, 2025 · 4 min read

Rodrigo Lopes

Associate Software Engineer @ Zeebe

Summary:

With Camunda 8.8, a new unified Camunda Exporter is introduced that directly populates data records consumable by read APIs on the secondary storage. This significantly reduces the time until eventually consistent data becomes available on Get and Search APIs. It also removes unnecessary duplication across multiple indices due to the previous architecture.

This architectural change prompted us to re-run the retention tests to compare PI retention in historical indexes under the same conditions as Camunda 8.7.

The historical data refers to exported data from configured exporters, such as records of completed process instances, tasks, and events that are no longer part of the active (runtime) state but are retained for analysis, auditing, or reporting.

The goal of this experiment is to compare the amount of PIs that we can retain in historical data between Camunda 8.7 and 8.8 running with the same hardware.

Resilience of dynamic scaling

October 2, 2025 · 3 min read

Deepthi Akkoorath

Principal Software Engineer @ Camunda

With version 8.8, we introduced the ability to add new partitions to an existing Camunda cluster. This experiment aimed to evaluate the resilience of the scaling process under disruptive conditions.

Summary:

Several bugs were identified during testing.
After addressing these issues, scaling succeeded even when multiple nodes were restarted during the operation.

Experiment

Baseline: healthy 3-broker cluster​

Injecting failure: stopping one ECS task​

Replacement task and recovery​

What we learned​

Scheduler introduction​

Chaos experiment​

Expected outcomes​

Setup​

Enabling the scheduler​

Node disconnect​

Experiment​

Starting the cluster​

Disconnecting camunda-0​

Disconnecting camunda-1​

Rejoining brokers​

Bonus: Retention​

Conclusion​

Chaos experiment​

Expected outcomes​

Setup​

Experiment​

Measurement​

Results​

Conclusion​

Future work​

Baseline: healthy 3-broker cluster

Injecting failure: stopping one ECS task

Replacement task and recovery

What we learned

Scheduler introduction

Chaos experiment

Expected outcomes

Setup

Enabling the scheduler

Node disconnect

Experiment

Starting the cluster

Disconnecting camunda-0

Disconnecting camunda-1

Rejoining brokers

Bonus: Retention

Conclusion

Chaos experiment

Expected outcomes

Setup

Experiment

Measurement

Results

Conclusion

Future work