Skip to main content

43 posts tagged with "availability"

View All Tags

REST API and OIDC

· 9 min read
Christopher Kujawa
Principal Software Engineer @ Camunda
Pranjal Goyal
Senior Software Engineer @ Reliability Testing Team
Jonathan Ballet
Senior Software Engineer @ Reliability Testing Team

Over the past weeks, we have been spending more time improving our load testing and reliability testing coverage. One of the things we did was to enable REST API (by default, we tend to use gRPC).

While doing such, we were experiencing a weird load pattern. This seems to occur when enabling the REST API usage in our load tester clients, together with OIDC.

On today's Chaos day, we want to verify how the system behaves when using the REST API and OIDC together, and how this changes under different loads and versions. We were also validating whether this was related to the cluster configuration (testing with SaaS).

TL;DR; We were seeing recurring throughput drops, especially at higher load (300 PIs), but at lower load they were not visible. The issue was reproducible in 8.8 as well, so it was not related to the changes in 8.9. We couldn't reproduce the pattern in SaaS, as we weren't able to achieve the same load with the small clusters we used. While experimenting, we discovered several areas for improvement. The root cause turned out to be JWT tokens expiring while requests queued in the Apache HttpAsyncClient connection pool. Nic fixed this by moving token injection to after connection acquisition via #50124 🚀

rest-bug

C8 on ECS: Restart Tasks

· 5 min read
Deepthi Akkoorath
Principal Software Engineer @ Camunda
Rodrigo Lopes
Associate Software Engineer @ Zeebe

With 8.9, we support C8 deployments on ECS. Camunda 8 is originally designed for Kubernetes StatefulSets, where each broker has a stable identity and disk. On Amazon ECS, tasks are ephemeral: IPs and container instances change frequently, and you rely on external storage like EFS and S3 instead of node-local disks.

To make this work safely, the Camunda 8 ECS reference architecture introduces a dynamic NodeIdProvider backed by Amazon S3. Each ECS task:

  • Competes for a lease stored in S3 that represents a specific logical broker node ID.
  • When it acquires the lease, it becomes that broker and uses a dedicated directory on shared EFS for its data.
  • Periodically renews the lease; if renewal fails or preconditions are violated, the task shts down immediately to avoid corrupting data or having two brokers think they own the same node.

The experiments in this post tests how well this S3-backed lease mechanism behaves under specfic failure scenarios where a task is killed and replaced by a new one.

Experiment

Our first chaos experiment on ECS was simple: what happens to a Camunda 8 cluster on AWS ECS when we kill a single broker task by hand?

The cluster was running Camunda 8 (Zeebe) on AWS ECS with 3 brokers and 3 partitions. Before we started the experiment, the dashboards showed a healthy topology, stable processing and exporting rates. The AWS console confirmed three running, healthy tasks for the orchestration cluster service.

Baseline: healthy 3-broker cluster

At steady state:

  • Cluster topology: 3 brokers, each participating in the 3 partitions as leader or follower.
  • Health: All partitions reported as healthy, with no restarts.
  • Throughput: Processing and exporting metrics were flat and stable.
  • ECS: Service view showed 3/3 tasks running and healthy.

Dashboard showing healthy brokers

AWS console showing healthy tasks

Injecting failure: stopping one ECS task

To inject a failure, we manually stopped one of the ECS tasks for the orchestration cluster from the AWS console.

Stop task from AWS console

This triggers a graceful shutdown of the broker, and we can see that NodeIdProvider released its S3 lease.

March 25, 2026, 10:34
[2026-03-25 09:34:19.666] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.repository.s3.S3NodeIdRepository - Release lease Initialized[metadata=Metadata[task=Optional[03acfc2a-6ff8-4e76-8e56-0a2a4e7227e7], version=Version[version=1], acquirable=true], lease=Lease[taskId=03acfc2a-6ff8-4e76-8e56-0a2a4e7227e7, timestamp=1774431273727, nodeInstance=NodeInstance[id=1, version=Version[version=1]], knownVersionMappings=VersionMappings[mappingsByNodeId={0=Version[version=1], 1=Version[version=1], 2=Version[version=1]}]], eTag="07b2daecf534e87cae5a3993f1102b22"]
orchestration-cluster
March 25, 2026, 10:34
[2026-03-25 09:34:19.638] [SpringApplicationShutdownHook] [{broker-id=Broker-1}] INFO io.camunda.zeebe.broker.system - Broker shut down.
orchestration-cluster```

Replacement task and recovery

ECS replaces the stopped task to meet the configured desired task count.

  1. The old task went into deprovisioning and eventually stopped.

Deprovisioning

  1. ECS launched a new task for the same service a couple of minutes later.

Provisioning 3. On startup, the new broker instance:

  • Acquired the S3 lease for the same logical node with a new version (v2).
  • Copied the previous data directory into a fresh v2 directory (versioned data layout).
March 25, 2026, 10:36
[2026-03-25 09:36:27.555] [main] INFO io.camunda.zeebe.dynamic.nodeid.fs.VersionedNodeIdBasedDataDirectoryProvider - Initializing data directory /usr/local/camunda/data/node-1/v2 by copying from /usr/local/camunda/data/node-1/v1
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:27.037] [main] WARN io.camunda.configuration.beanoverrides.BrokerBasedPropertiesOverride - The following legacy property is no longer supported and should be removed in favor of 'camunda.data.exporters': zeebe.broker.exporters
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:26.979] [main] WARN io.camunda.configuration.UnifiedConfigurationHelper - The following legacy configuration properties should be removed in favor of 'camunda.data.primary-storage.directory': zeebe.broker.data.directory
orchestration-cluster
March 25, 2026, 10:36
[2026-03-25 09:36:26.912] [NodeIdProvider] INFO io.camunda.zeebe.dynamic.nodeid.RepositoryNodeIdProvider - Acquired lease w/ nodeId=NodeInstance[id=1, version=Version[version=2]]. Initialized[metadata=Metadata[task=Optional[5228b3d3-7cde-4365-b4c5-7afd0ae094cd], version=Version[version=2], acquirable=true], lease=Lease[taskId=5228b3d3-7cde-4365-b4c5-7afd0ae094cd, timestamp=1774431401724, nodeInstance=NodeInstance[id=1, version=Version[version=2]], knownVersionMappings=VersionMappings[mappingsByNodeId={1=Version[version=2]}]], eTag="9f0c6e1c2a92bbaa1fde872e1d545e05"]
orchestration-cluster

The new task becomes healthy and the orchestration cluster service is now fully healthy.

Recovered

What we learned

This first experiment validated that:

  • S3-based leases behave correctly under node loss: when a task is killed, the broker releases its lease, and a new task can safely acquire a new versioned lease.
  • Graceful shutdown still happens under forced task stop: even though we stopped the task from the ECS console, the broker had enough time to drain and shut down its internal components cleanly.
  • Replace task becomes healthy: the replacement task comes up, reuses the data via a new versioned directory, and rejoins the cluster without any issues.

RTO with varying backup schedules

· 7 min read
Lena Schönburg
Senior Software Engineer @ Zeebe

With the upcoming Camunda 8.9 release, we will support RDBMS as secondary storage as an alternative to Elasticsearch and OpenSearch. Because there is no common API for taking backups of relational databases, we had to revise our approach to backup and restore significantly. We now support a continuous backup mode that allows users to take backups of secondary and primary storage independently from each other. Backups of primary storage will cover a contiguous time range, allowing us to restore from one or multiple primary storage backups to match the state in secondary storage.

In this chaos day, we are testing our Recovery Time Objective (RTO), the time it takes to recover data from backups and become fully operational again, with varying backup schedules. When backups are taken less frequently, each backup covers a longer time window and therefore includes more accumulated log segments. We want to understand how this translates to RTO.

Elastic restart impact on Camunda

· 6 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's Chaos Day, we explored the impact of Elasticsearch availability on Camunda 8.9+ (testing against main).

While we already tested last year the resiliency of our System against ES restarts (see previous post, we have run the OC cluster only. Additionally, certain configurations have been improved (default replica configurations, etc.).

This time, we wanted to see how the system behaves with OC + ES Exporter + Optimize enabled.

I was joined by Jon and Pranjal, the newest members of the reliability testing team.

TL;DR; While we found that short ES unavailability does not affect processing performance, depending on the configuration, it can affect data availability. For longer outages, this would then also impact Camunda processing. To mitigate this problem, corresponding exporters should be configured, but the necessary configurations are not properly exposed and need to be fixed in the Helm Chart.

data-avail

Experimenting with data availability metric

· 9 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

Happy New Year, everyone 🎉! Time for some chaos experiments again 😃.

In today's chaos day, I was joined by Pranjal, our newest addition to the reliability testing team at Camunda (welcome 🎉)

We planned to experiment with the new data availability metric, which we have recently added to our load testing infrastructure, for more details see related PR. In short, we measure the time from creating a process instance until it is actually available to the user via the API. This allows us to reason how long it also takes for Operate to show new data.

The goal for today was to gain a better understanding of how the system behaves under higher loads and how this affects data availability. The focus was set here on the orchestration cluster, meaning data availability for Operate and Tasklist.

TL;DR: We have observed that increasing the process instance creation rate results in higher data availability times. While experimenting with different workloads, we discovered that the typical load test is still not working well. During our investigation of the platform behaviors, we found a recently introduced regression that is limiting our general maximum throughput. We also identified suboptimal error handling in the Gateway, which causes request retries and can exacerbate load issues.

comparison-latency.png

REST API: From ForkJoin to a Dedicated Thread Pool

· 7 min read
Berkay Can
Software Engineer @ Zeebe

During the latest REST API Performance load tests, we discovered that REST API requests suffered from significantly higher latency under CPU pressure, even when throughput numbers looked comparable. While adding more CPU cores alleviated the issue, this wasn’t a sustainable solution — it hinted at an inefficiency in how REST handled broker responses. See related section from the previous blog post.

This blog post is about how we diagnosed the issue, what we found, and the fix we introduced in PR #36517 to close the performance gap.

Resiliency against ELS unavailability

· 11 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

Due to recent initiatives and architecture changes, we coupled us even more against the secondary storage (often Elasticsearch, but can also be OpenSearch or in the future RDBMS).

We now have one single application to run Webapps, Gateway, Broker, Exporters, etc., together. Including the new Camunda Exporter exporting all necessary data to the secondary storage. On bootstrap we need to create an expected schema, so our components work as expected, allowing Operate and Tasklist Web apps to consume the data and the exporter to export correctly. Furthermore, we have a new query API (REST API) allowing the search for available data in the secondary storage.

We have seen in previous experiments and load tests that unavailable ELS and not properly configured replicas can cause issues like the exporter not catching up or queries not succeeding. See related GitHub issue.

In todays chaos day, we want to play around with the replicas setting of the indices, which can be set in the Camunda Exporter (which is in charge of writing the data to the secondary storage).

TL;DR; Without the index replicas set, the Camunda Exporter is directly impacted by ELS node restarts. The query API seem to handle this transparently, but changing the resulting data. Having the replicas set will cause some performance impact, as the ELS node might run into CPU throttling (as they have much more to do). ELS slowing down has an impact on processing as well due to our write throttling mechanics. This means we need to be careful with this setting, while it gives us better availability (CamundaExporter can continue when ELS nodes restart), it might come with some cost.

Dynamic Scaling: probing linear scalability

· 7 min read
Carlo Sana
Senior Software Engineer @ Zeebe

Hypothesis

The objective of this chaos day is to estimate the scalability of Zeebe when brokers and partitions are scaled together: we expect to be able to see the system scaling linearly with the number of brokers/partition in terms of throughput and back pressure, while maintaining predictable latency.

General Experiment setup

To test this, we ran a benchmark using the latest alpha version of Camunda 8.8.0-alpha6, with the old ElasticsearchExporter disabled, and the new CamundaExporter enabled. We also made sure Raft leadership was balanced before starting the test, meaning each broker is leader for exactly one partition, and we turned on partition scaling by adding the following environment variable:

  • ZEEBE_BROKER_EXPERIMENTAL_FEATURES_ENABLEPARTITIONSCALING=true

Each broker also has a SSD-class volume with 32GB of disk space, limiting them to a few thousand IOPS. The processing load was 150 processes per second, with a large payload of 32KiB each. Each process instance has a single service task:

one-task

The processing load is generated by our own benchmarking application.

Initial cluster configuration

To test this hypothesis, we will start with a standard configuration of the Camunda orchestration cluster:

  • 3 nodes
  • 3 partitions
  • CPU limit: 2
  • Memory limit: 2 GB

We will increase the load through a load generator in fixed increments until we start to see the nodes showing constant non zero backpressure, which is a sign that the system has hit its throughput limits.

Target cluster configuration

Once that level of throughput is increased, we will scale broker & partitions while the cluster is under load to the new target value:

  • 6 nodes
  • 6 partitions
  • CPU limit: 2
  • Memory limit: 2 GB

Experiment

We expect that during the scaling operation the backpressure/latencies might worsen, but only temporarily, as once the scaling operation has completed, the additional load it generate is not present anymore.

Then, we will execute the same procedure as above, until we hit 2x the critical throughput hit before.

Expectation

If the system scales linearly, we expect to see similar level of performance metrics for similar values of the ratios PI (created/complete) per second / nr. of partition.

Steady state

The system is started with a throughput of 150 Process instances created per second. As this is a standard benchmark configuration, nothing unexpected happens:

  • The same number of process instances are completed as the ones created
  • The expected number of jobs is completed per unit of time

At this point, we have the following topology:

initial-topology

First benchmark: 3 broker and 3 partitions

Let's start increasing the load incrementally, by adding 30 Process instances/s for every step.

TimeBrokersPartitionsThroughputCPU UsageThrottling (CPU)Backpressure
09:3033150 PI/s, 150 jobs/s1.28 / 1.44 / 1.0212% / 7% / 1%0
09:4933180 PI/s, 180 jobs/s1.34 / 1.54 / 1.1220% / 17% / 2%0
10:0033210 PI/s, 210 jobs/s1.79 / 1.62 / 1.3328% / 42% / 4%0
10:1233240 PI/s, 240 jobs/s1.77 / 1.95 / 1.6245% / 90% / 26%0/0.5%

At 240 Process Instances spawned per second, the system starts to hit the limits: CPU usage @ 240 PI/s CPU throttling@ 240 PI/s

And the backpressure is not zero anymore: Backpressure @ 240 PI/s

  • The CPU throttling reaches almost 90% on one node (this is probably caused by only one node being selected as gateway as previously noted)
  • Backpressure is now constantly above zero, even if it's just 0.5%, it's a sign that we are reaching the throughput limits.

Second part of the benchmark: scaling to 6 brokers and 6 partitions

With 240 process instances per second being spawned, we send the commands to scale the cluster.

We first scale the zeebe statefulset to 6 brokers. As soon as the new brokers are running, even before they are healthy, we can send the command to include them in the cluster and to increase the number of partition to 6.

This can be done following the guide in the official docs.

Once the scaling has been completed, as can be seen from the Cluster operation section in the dashboard, we see the newly created partitions participate in the workload.

We now have the following topology:

six-partitions-topology

As we did before, let's start increasing the load incrementally as we did with the other cluster configuration.

TimeBrokersPartitionsThroughputCPU UsageThrottling (CPU)BackpressureNotes
10:2766240 PI/s0.92/1.26/0.74/0.94/0.93/0.932.8/6.0/0.3/2.8/3.4/3.180After scale up
11:0566300 PI/s1.17/1.56/1.06/1.23/1.19/1.189%/29%/0.6%/9%/11%/10%0Stable
11:1066360 PI/s1.39/1.76/1.26/1.43/1.37/1.4219%/42%/2%/16%/21%/22%0Stable
11:1066420 PI/s1.76/1.89/1.50/1.72/1.50/1.7076%/84%/52%/71%/60%/65%0 (spurts on 1 partition)Pushing hard

However, at 11:32 one of the workers restarted, causing a spike in the processing due to jobs being yielded back to the engine, less jobs to be activated, and thus less to be completed. This caused a job backlog to build up in the engine. Once the worker restarted, the backlog was drained, leading to a spike in job completion requests: around 820 req/s, as opposed to the expected 420 req/s.

Because of this extra load, the cluster started to consume even more CPU, resulting in heavy CPU throttling from the cloud provider.

CPU usage @ 420 PI/s CPU throttling @ 420 PI/s

On top of this, eventually a broker restarted (most likely as we run on spot VMs). In order to continue with our test, we scaled the load down to 60 PI/s to give the cluster the time to heal.

Once the cluster was healthy again, we raised the throughput back to 480 PI/s to verify the scalability with twice as much throughput as the initial configuration.

The cluster was able to sustain 480 process instances per second with similar levels of backpressure of the initial configuration:

Backpressure @ 480 PI/s

We can see below that CPU usage is high, and there is still some throttling, indicating we might be able to do more with a little bit of vertical scaling, or by scaling out and reducing the number of partitions per broker:

CPU usage @ 480 PI/s CPU throttling

Conclusion

We were able to verify that the cluster can scale almost linearly with new brokers and partitions, so long as the other components, like the secondary storage, workers, connectors, etc., are able to sustain a similar.

In particular, making sure that the secondary storage is able to keep up with the throughput turned out to be crucial to keep the cluster stable in order to avoid filling up the Zeebe disks, which would bring to a halt the cluster.

We encountered a similar issue when one worker restarts: initially it creates a backlog of unhandled jobs, which turns into a massive increase in requests per second when the worker comes back, as it starts activating jobs faster than the cluster can complete them.

Finally, with this specific test, it would be interesting to explore the limits of vertical scalability, as we often saw CPU throttling being a major blocker for processing. This would make for an interesting future experiment.

Follow up REST API performance

· 26 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

Investigating REST API performance

This post collates the experiments, findings, and lessons learned during the REST API performance investigation.

There wasn't one explicit root cause identified. As it is often the case with such performance issues, it is the combination of several things.

Quint essence: REST API is more CPU intense/heavy than gRPC. You can read more about this in the conclusion part. We have discovered ~10 issues we have to follow up with, where at least 2-3 might have a significant impact in the performance. Details can be found in the Discovered issues section

Performance of REST API

· 8 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's Chaos day we wanted to experiment with the new REST API (v2) as a replacement for our previous used gRPC API.

Per default, our load tests make use of the gRPC, but as we want to make REST API the default and release this fully with 8.8, we want to make sure to test this accordingly in regard to reliability.

TL;DR; We observed severe performance regression when using the REST API, even when job streaming is in use by the job workers (over gRPC). Our client seems to have a higher memory consumption, which caused some instabilities in our tests as well. With the new API, we lack certain observability, which makes it harder to dive into certain details. We should investigate this further and find potential bottlenecks and improvements.

general