Skip to main content

Resilience of dynamic scaling

· 3 min read
Deepthi Akkoorath
Senior Software Engineer @ Zeebe

With version 8.8, we introduced the ability to add new partitions to an existing Camunda cluster. This experiment aimed to evaluate the resilience of the scaling process under disruptive conditions.

Summary:

  • Several bugs were identified during testing.
  • After addressing these issues, scaling succeeded even when multiple nodes were restarted during the operation.

Chaos experiment

Expected

The scaling operation should complete successfully, even if multiple nodes are restarted during the process.

Setup

We initialized a cluster with the following configuration:

  • clusterSize = 3
  • partitionCount = 3
  • replicationFactor = 3

The cluster is running with a steady throughput of completing 150 jobs/s.

The target configuration was:

  • clusterSize = 6
  • partitionCount = 6
  • replicationFactor = 3

Experiment

We first started new brokers using using:

kubectl scale statefulset <zeebe-statefulset> --replicas=6

Next, we triggered the scaling operation via Camunda's management API:

curl -X 'PATCH' \
'http://localhost:9600/actuator/cluster' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"brokers": {
"add": [3,4,5]
},
"partitions": {
"count": 6,
"replicationFactor": 3
}
}'

Once the API request succeeded and scaling began, we started restarting pods one at a time using kubectl delete pod <zeebe-n>, choosing pods in random order. Each pod was restarted multiple times throughout the experiment.

Results

During the initial run, scaling stalled at a specific step. The cluster status showed that bootstrapping partition 6 was not progressing:

  "pendingChange": {
"id": 2,
"status": "IN_PROGRESS",
"completed": [
...
],
"pending": [
{
"operation": "PARTITION_BOOTSTRAP",
"brokerId": 5,
"partitionId": 6,
"priority": 3,
"brokers": []
},
...
]
}

Investigation uncovered a bug. After fixing it, we repeated the experiment.

In the second attempt, we observed that after bootstrapping a new partition, adding a second replica to that partition became stuck. This revealed an unhandled edge case in the Raft join protocol. Since manual pod restarts are non-deterministic, reproducing this issue was challenging. To address this, we incorporated the scenario into our randomized Raft test, which exposed additional edge cases that could block the join operation. These were subsequently fixed. Thanks to the improved test coverage, we are now confident that these issues have been resolved.

After applying the fixes, we reran the experiment. 🎉 This time, scaling completed successfully, even with repeated node restarts.

The scaling began at 11:35, with node restarts occurring during the operation.

Pod restarts during scaling

Throughput was severely impacted by the muliple restarts of the brokers at the same time.

Throughput affected due to restarts

But the scaling process still completed within 15 minutes.

scaling

The cluster status query shows the last completed scaling operation.

"lastChange": {
"id": 2,
"status": "COMPLETED",
"startedAt": "2025-10-02T11:35:59.852+0000",
"completedAt": "2025-10-02T11:49:32.796+0000"
},

While the operation would have finished faster without node failures, the key takeaway is that it remained resilient in the face of disruptions.

Participants

  • @deepthi

REST API: From ForkJoin to a Dedicated Thread Pool

· 7 min read
Berkay Can
Software Engineer @ Zeebe

During the latest REST API Performance load tests, we discovered that REST API requests suffered from significantly higher latency under CPU pressure, even when throughput numbers looked comparable. While adding more CPU cores alleviated the issue, this wasn’t a sustainable solution — it hinted at an inefficiency in how REST handled broker responses. See related section from the previous blog post.

This blog post is about how we diagnosed the issue, what we found, and the fix we introduced in PR #36517 to close the performance gap.

The Problem

A difference we spotted early between REST API and gRPC request handling was the usage of the BrokerClient:

  • gRPC: BrokerClient calls are wrapped with retries and handled directly in the request thread.
  • REST: requests are executed without retries, and responses are handled asynchronously using the common ForkJoinPool.

On clusters with 2 CPUs, the JVM defaults to a single thread for the common ForkJoinPool. Our expectation was that this could cause contention: one thread might not be fast enough to process responses in time, leading to delays in the Gateway ↔ Broker request-response cycle.

The Solution Journey

Solving this issue wasn’t a straight line — we tried a few approaches before landing on the final design. Each iteration gave us valuable insights about Java’s thread pool usage and its impact on REST API performance.

1. First Attempt: SynchronousQueue

We began with a custom ThreadPoolExecutor that used a SynchronousQueue for task handoff:

new ThreadPoolExecutor(
corePoolSize,
maxPoolSize,
keepAliveSeconds,
TimeUnit.SECONDS,
new SynchronousQueue<>(),
threadFactory,
new ThreadPoolExecutor.CallerRunsPolicy());

A SynchronousQueue has no capacity — each task submission must immediately find a free worker thread. If no thread is available, the caller blocks until one frees up.

In practice, this meant concurrency was artificially limited: bursts of REST requests had to wait for a thread to become available, reducing parallelism. The results were modest:

  • Request rate: unchanged
  • Average latency: improved slightly (from ~120 ms → ~100 ms)
  • CPU throttling: dropped only marginally (100% → 90%)

Before any changes introduced, results looked like this: syncqueue-req-before syncqueue-cpu-before

Then we started benchmarking after introducing executor with SynchronousQueue: syncqueue-req-after syncqueue-cpu-after

This hinted we were on the right track — a dedicated executor helped — but the queue strategy was too restrictive.

2. Experiment: ArrayBlockingQueue + AbortPolicy

Next, we switched to an ArrayBlockingQueue with a capacity of 64. This allowed the pool to buffer short micro-bursts of requests instead of blocking immediately. At the same time, we replaced CallerRunsPolicy with AbortPolicy:

new ThreadPoolExecutor.AbortPolicy();

The idea was to fail fast on saturation: if the queue filled up and no thread was free, the executor would throw RejectedExecutionException immediately. This boosted the measured request rate, but at a cost:

  • Many requests were simply rejected outright.
  • Measuring true performance became tricky, since high throughput numbers hid the rejections.
  • Operationally, this wasn’t acceptable — REST clients would constantly see errors under load.

Because of this, we abandoned the fail-fast approach.

3. Final Decision: CallerRunsPolicy + Higher Max Pool Size

Finally, we returned to CallerRunsPolicy. Instead of rejecting tasks, this policy makes the caller thread execute the task itself when the pool is saturated. This introduces natural backpressure: clients slow down automatically when the system is busy, without dropping requests.

To give the executor more headroom, we also increased the maximum pool size from availableProcessors * 2 to availableProcessors * 8.

This combination made the breakthrough:

  • Request rate (REST): stabilized around 150 RPS with spikes up to 200 RPS
  • Average latency: dropped dramatically (from ~120 ms → ~25 ms)
  • CPU throttling: reduced significantly (100% → 30-40%)

Here are the final results: final-decision-result-rest final-decision-result-cpu

This design struck the right balance: elastic concurrency, backpressure, and resource efficiency.

Key Takeaways

  1. SynchronousQueue limits concurrency — good for handoff semantics, but too restrictive for REST workloads.
  2. Fail-fast rejection looks good in benchmarks but fails in production — clients can’t handle widespread request errors.
  3. CallerRunsPolicy provides natural backpressure — throughput stabilizes without dropping requests, and latency improves.
  4. CPU-aware max pool sizing matters — scaling pool size relative to cores unlocks performance gains.

More Benchmarking

To validate that our executor change holds up across configurations, we ran extra tests on our benchmark cluster provisioned with 2 vCPUs per Camunda application.

1) Comparing Max Pool Size (×4, ×8, ×16)

We ran the same workload while varying maxPoolSize = availableProcessors × {4, 8, 16}. Below are the observed tops from Grafana panels in the screenshots:

maxPoolSizeMultiplier=4 max-pool-size-multiplier-4

maxPoolSizeMultiplier=8 max-pool-size-multiplier-8

maxPoolSizeMultiplier=16 max-pool-size-multiplier-16

MultiplierRequest Rate (proc-instances)Request Rate (completion)Avg Latency (proc-instances)Avg Latency (completion)
×4~51.6 req/s~42.3 req/s~40.2 ms~57.4 ms
×8~144.4 req/s~144.7 req/s~21.4 ms~24.1 ms
×16~22.7 req/s~19.5 req/s~38.4 ms~55.4 ms

What this suggests (in our setup):

  • ×8 is the clear sweet spot: highest sustained throughput with the lowest average latencies.
  • ×4 under-provisions the pool (lower RPS, higher latency).
  • ×16 shows diminishing/negative returns (likely scheduler contention or oversubscription): much lower RPS and latencies drifting back up.

Takeaway: In our setup, ×8 balances elasticity and scheduling overhead, delivering the best throughput–latency trade-off.

3) Comparing Queue Capacity (16 vs 64 vs 256)

We varied the executor queue capacity and compared 16, 64 (our current/default for this run), and 256 under the same workload.
Below are the observed tops from the Grafana panels for the two hot endpoints:

  • POST /v2/process-instances
  • POST /v2/jobs/{jobKey}/completion

queueCapacity=16
queue-capacity-16

queueCapacity=64
queue-capacity-64

queueCapacity=256
queue-capacity-256

Measured summary

Queue CapacityRequest Rate (proc-instances)Request Rate (completion)Avg Latency (proc-instances)Avg Latency (completion)
16~78.2 req/s~56.0 req/s~28.2 ms~40.8 ms
64~144.4 req/s~144.7 req/s~21.4 ms~24.1 ms
256~80.2 req/s~61.2 req/s~29.3 ms~43.0 ms

What this suggests (in our setup):

  • Queue = 64 is the clear sweet spot: highest sustained throughput (~144 req/s on hot endpoints) with the lowest avg latencies (~21–24 ms). Likely large enough to absorb micro-bursts, but small enough to avoid long queue waits.
  • Queue = 16 under-buffers: lower RPS (~78 / ~56 req/s) and higher latency (~28–41 ms). With CallerRunsPolicy, the queue fills quickly and the caller runs tasks often → frequent backpressure throttles producers.
  • Queue = 256 shows diminishing/negative returns relative to 64: lower RPS (~80 / ~61 req/s) and higher latency (~29–43 ms). The big buffer hides saturation, adding queueing delay before execution without delivering extra useful work at the same CPU budget.

Conclusion

Moving off the common ForkJoinPool to a dedicated, CPU-aware executor with bounded queueing and CallerRunsPolicy backpressure turned an overload problem into graceful degradation: fewer 5xxs, steadier RPS, and far lower tail latency under the same CPU budget.

Final takeaways

  • Isolation beats sharing. A dedicated pool prevents noisy neighbors from the common ForkJoinPool.
  • Backpressure beats drops. CallerRunsPolicy slows producers when saturated, stabilizing the system without mass rejections.
  • Right-sized knobs matter. maxPoolSize ≈ cores × 8 and queueCapacity ≈ 64 hit the best throughput/latency balance in our runs; smaller queues over-throttle, larger queues hide saturation and add wait time.
  • Results are environment-specific. At higher core counts, the sweet spot may shift—re-benchmark when CPUs or workload mix change.

Note: Results are environment-specific; at higher core counts, the sweet spot may shift—re-benchmark when CPUs or workload mix change. In this experiment, we focused on CPU-bound scenarios.

Resiliency against ELS unavailability

· 9 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Due to recent initiatives and architecture changes, we coupled us even more against the secondary storage (often Elasticsearch, but can also be OpenSearch or in the future RDBMS).

We now have one single application to run Webapps, Gateway, Broker, Exporters, etc., together. Including the new Camunda Exporter exporting all necessary data to the secondary storage. On bootstrap we need to create an expected schema, so our components work as expected, allowing Operate and Tasklist Web apps to consume the data and the exporter to export correctly. Furthermore, we have a new query API (REST API) allowing the search for available data in the secondary storage.

We have seen in previous experiments and load tests that unavailable ELS and not properly configured replicas can cause issues like the exporter not catching up or queries not succeeding. See related GitHub issue.

In todays chaos day, we want to play around with the replicas setting of the indices, which can be set in the Camunda Exporter (which is in charge of writing the data to the secondary storage).

TL;DR; Without the index replicas set, the Camunda Exporter is directly impacted by ELS node restarts. The query API seem to handle this transparently, but changing the resulting data. Having the replicas set will cause some performance impact, as the ELS node might run into CPU throttling (as they have much more to do). ELS slowing down has an impact on processing as well due to our write throttling mechanics. This means we need to be careful with this setting, while it gives us better availability (CamundaExporter can continue when ELS nodes restart), it might come with some cost.

Dynamic Scaling: probing linear scalability

· 7 min read
Carlo Sana
Senior Software Engineer @ Zeebe

Hypothesis

The objective of this chaos day is to estimate the scalability of Zeebe when brokers and partitions are scaled together: we expect to be able to see the system scaling linearly with the number of brokers/partition in terms of throughput and back pressure, while maintaining predictable latency.

General Experiment setup

To test this, we ran a benchmark using the latest alpha version of Camunda 8.8.0-alpha6, with the old ElasticsearchExporter disabled, and the new CamundaExporter enabled. We also made sure Raft leadership was balanced before starting the test, meaning each broker is leader for exactly one partition, and we turned on partition scaling by adding the following environment variable:

  • ZEEBE_BROKER_EXPERIMENTAL_FEATURES_ENABLEPARTITIONSCALING=true

Each broker also has a SSD-class volume with 32GB of disk space, limiting them to a few thousand IOPS. The processing load was 150 processes per second, with a large payload of 32KiB each. Each process instance has a single service task:

one-task

The processing load is generated by our own benchmarking application.

Initial cluster configuration

To test this hypothesis, we will start with a standard configuration of the Camunda orchestration cluster:

  • 3 nodes
  • 3 partitions
  • CPU limit: 2
  • Memory limit: 2 GB

We will increase the load through a load generator in fixed increments until we start to see the nodes showing constant non zero backpressure, which is a sign that the system has hit its throughput limits.

Target cluster configuration

Once that level of throughput is increased, we will scale broker & partitions while the cluster is under load to the new target value:

  • 6 nodes
  • 6 partitions
  • CPU limit: 2
  • Memory limit: 2 GB

Experiment

We expect that during the scaling operation the backpressure/latencies might worsen, but only temporarily, as once the scaling operation has completed, the additional load it generate is not present anymore.

Then, we will execute the same procedure as above, until we hit 2x the critical throughput hit before.

Expectation

If the system scales linearly, we expect to see similar level of performance metrics for similar values of the ratios PI (created/complete) per second / nr. of partition.

Steady state

The system is started with a throughput of 150 Process instances created per second. As this is a standard benchmark configuration, nothing unexpected happens:

  • The same number of process instances are completed as the ones created
  • The expected number of jobs is completed per unit of time

At this point, we have the following topology:

initial-topology

First benchmark: 3 broker and 3 partitions

Let's start increasing the load incrementally, by adding 30 Process instances/s for every step.

TimeBrokersPartitionsThroughputCPU UsageThrottling (CPU)Backpressure
09:3033150 PI/s, 150 jobs/s1.28 / 1.44 / 1.0212% / 7% / 1%0
09:4933180 PI/s, 180 jobs/s1.34 / 1.54 / 1.1220% / 17% / 2%0
10:0033210 PI/s, 210 jobs/s1.79 / 1.62 / 1.3328% / 42% / 4%0
10:1233240 PI/s, 240 jobs/s1.77 / 1.95 / 1.6245% / 90% / 26%0/0.5%

At 240 Process Instances spawned per second, the system starts to hit the limits: CPU usage @ 240 PI/s CPU throttling@ 240 PI/s

And the backpressure is not zero anymore: Backpressure @ 240 PI/s

  • The CPU throttling reaches almost 90% on one node (this is probably caused by only one node being selected as gateway as previously noted)
  • Backpressure is now constantly above zero, even if it's just 0.5%, it's a sign that we are reaching the throughput limits.

Second part of the benchmark: scaling to 6 brokers and 6 partitions

With 240 process instances per second being spawned, we send the commands to scale the cluster.

We first scale the zeebe statefulset to 6 brokers. As soon as the new brokers are running, even before they are healthy, we can send the command to include them in the cluster and to increase the number of partition to 6.

This can be done following the guide in the official docs.

Once the scaling has been completed, as can be seen from the Cluster operation section in the dashboard, we see the newly created partitions participate in the workload.

We now have the following topology:

six-partitions-topology

As we did before, let's start increasing the load incrementally as we did with the other cluster configuration.

TimeBrokersPartitionsThroughputCPU UsageThrottling (CPU)BackpressureNotes
10:2766240 PI/s0.92/1.26/0.74/0.94/0.93/0.932.8/6.0/0.3/2.8/3.4/3.180After scale up
11:0566300 PI/s1.17/1.56/1.06/1.23/1.19/1.189%/29%/0.6%/9%/11%/10%0Stable
11:1066360 PI/s1.39/1.76/1.26/1.43/1.37/1.4219%/42%/2%/16%/21%/22%0Stable
11:1066420 PI/s1.76/1.89/1.50/1.72/1.50/1.7076%/84%/52%/71%/60%/65%0 (spurts on 1 partition)Pushing hard

However, at 11:32 one of the workers restarted, causing a spike in the processing due to jobs being yielded back to the engine, less jobs to be activated, and thus less to be completed. This caused a job backlog to build up in the engine. Once the worker restarted, the backlog was drained, leading to a spike in job completion requests: around 820 req/s, as opposed to the expected 420 req/s.

Because of this extra load, the cluster started to consume even more CPU, resulting in heavy CPU throttling from the cloud provider.

CPU usage @ 420 PI/s CPU throttling @ 420 PI/s

On top of this, eventually a broker restarted (most likely as we run on spot VMs). In order to continue with our test, we scaled the load down to 60 PI/s to give the cluster the time to heal.

Once the cluster was healthy again, we raised the throughput back to 480 PI/s to verify the scalability with twice as much throughput as the initial configuration.

The cluster was able to sustain 480 process instances per second with similar levels of backpressure of the initial configuration:

Backpressure @ 480 PI/s

We can see below that CPU usage is high, and there is still some throttling, indicating we might be able to do more with a little bit of vertical scaling, or by scaling out and reducing the number of partitions per broker:

CPU usage @ 480 PI/s CPU throttling

Conclusion

We were able to verify that the cluster can scale almost linearly with new brokers and partitions, so long as the other components, like the secondary storage, workers, connectors, etc., are able to sustain a similar.

In particular, making sure that the secondary storage is able to keep up with the throughput turned out to be crucial to keep the cluster stable in order to avoid filling up the Zeebe disks, which would bring to a halt the cluster.

We encountered a similar issue when one worker restarts: initially it creates a backlog of unhandled jobs, which turns into a massive increase in requests per second when the worker comes back, as it starts activating jobs faster than the cluster can complete them.

Finally, with this specific test, it would be interesting to explore the limits of vertical scalability, as we often saw CPU throttling being a major blocker for processing. This would make for an interesting future experiment.

Follow up REST API performance

· 20 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Investigating REST API performance

This post collates the experiments, findings, and lessons learned during the REST API performance investigation.

There wasn't one explicit root cause identified. As it is often the case with such performance issues, it is the combination of several things.

Quint essence: REST API is more CPU intense/heavy than gRPC. You can read more about this in the conclusion part. We have discovered ~10 issues we have to follow up with, where at least 2-3 might have a significant impact in the performance. Details can be found in the Discovered issues section

Performance of REST API

· 7 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In today's Chaos day we wanted to experiment with the new REST API (v2) as a replacement for our previous used gRPC API.

Per default, our load tests make use of the gRPC, but as we want to make REST API the default and release this fully with 8.8, we want to make sure to test this accordingly in regard to reliability.

TL;DR; We observed severe performance regression when using the REST API, even when job streaming is in use by the job workers (over gRPC). Our client seems to have a higher memory consumption, which caused some instabilities in our tests as well. With the new API, we lack certain observability, which makes it harder to dive into certain details. We should investigate this further and find potential bottlenecks and improvements.

general

How does Zeebe behave with NFS

· 13 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This week, we (Lena, Nicolas, Roman, and I) held a workshop where we looked into how Zeebe behaves with network file storage (NFS).

We ran several experiments with NFS and Zeebe, and messing around with connectivity.

TL;DR; We were able to show that NFS can handle certain connectivity issues, just causing Zeebe to process slower. IF we completely lose the connection to the NFS server, several issues can arise, like IOExceptions on flush (where RAFT goes into inactive mode) or SIGBUS errors on reading (like replay), causing the JVM to crash.

Lower memory consumption of Camunda deployment

· 9 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

I'm back to finally do some load testing again.

In the past months, we have changed our architecture. This was to deploy instead all of our components as a separate deployment, we now have one single statefulset. This statefulset is running our single Camunda standalone application, combining all components together.

simpler deployment

More details on this change we will share on a separate blog post. For simplicity, in our load tests (benchmark helm charts), we combined all the resources we had split over multiple deployments together, see related PR #213.

We are currently running our test with the following resources by default:

    Limits:
cpu: 2
memory: 12Gi
Requests:
cpu: 2
memory: 6Gi

In today's Chaos day, I want to look into our resource consumption and whether we can reduce our used requests and limits.

TL;DR; We have focused on experimenting with different memory resources, and were able to show that we can reduce the used memory by 75%, and our previous provisioned resources by more than 80% for our load tests.

News from Camunda Exporter project

· 4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In this Chaos day, we want to verify the current state of the exporter project and run benchmarks with it. Comparing with a previous version (v8.6.6) should give us a good hint on the current state and potential improvements.

TL;DR; The latency of user data availability has improved due to our architecture change, but we still need to fix some bugs before our planned release of the Camunda Exporter. This experiment allows us to detect three new bugs, fixing this should allow us to make the system more stable.

Impact of Camunda Exporter on processing performance

· 5 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In our last Chaos day we experimented with the Camunda Exporter MVP. After our MVP we continued with Iteration 2, where we migrated the Archiver deployments and added a new Migration component (allows us to harmonize indices).

Additionally, some fixes and improvements have been done to the realistic benchmarks that should allow us to better compare the general performance with a realistic good performing benchmark.

Actually, this is what we want to explore and experiment with today.

  • Does the Camunda Exporter (since the last benchmark) impact performance of the overall system?
    • If so how?
  • How can we potentially mitigate this?

TL;DR; Today's, results showed that enabling the Camunda Exporter causes a 25% processing throughput drop. We identified the CPU as a bottleneck. It seems to be mitigated by either adjusting the CPU requests or removing the ES exporter. With these results, we are equipped to make further investigations and decisions.