Skip to main content

12 posts tagged with "performance"

View All Tags

Building Confidence at Scale: How Camunda Ensures Platform Reliability Through Continuous Testing

· 8 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

As businesses increasingly rely on process automation for their critical operations, the question of reliability becomes paramount. How can you trust that your automation platform will perform consistently under pressure, recover gracefully from failures, and maintain performance over time?

At Camunda, we've been asking ourselves these same questions for years, and today I want to share how our reliability testing practices have evolved to ensure our platform meets the demanding requirements of enterprise-scale deployments. I will also outline our plans to further invest in this crucial area.

Stress testing Camunda

· 12 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's chaos experiment, we focused on stress-testing the Camunda 8 orchestration cluster under high-load conditions. We simulated a large number of concurrent process instances to evaluate the performance of processing and system reliability.

Due to our recent work in supporting load tests for different versions, we were able to compare how different Camunda versions handle stress.

TL;DR: Overall, we saw that all versions of the Camunda 8 orchestration cluster (with focus on the processing) are robust and can handle high loads effectively and reliably. In consideration of throughput and latency, with similar resource allocation among the brokers, 8.7.x outperforms other versions. If we consider our streamlined architecture (which now contains more components in a single application) and align the resources for 8.8.x, it can achieve similar throughput levels as 8.7.x, while maintaining significantly lower latency (a factor of 2). An overview of the results can be found in the Results section below.

info

[Update: 28.11.2025]

After the initial analysis, we conducted further experiments with 8.8 to understand why the measured processing performance was lower compared to 8.7.x. The blog post (including TL;DR) has been updated with the new findings in the section Further Experiments below.

Dynamic Scaling: probing linear scalability

· 7 min read
Carlo Sana
Senior Software Engineer @ Zeebe

Hypothesis

The objective of this chaos day is to estimate the scalability of Zeebe when brokers and partitions are scaled together: we expect to be able to see the system scaling linearly with the number of brokers/partition in terms of throughput and back pressure, while maintaining predictable latency.

General Experiment setup

To test this, we ran a benchmark using the latest alpha version of Camunda 8.8.0-alpha6, with the old ElasticsearchExporter disabled, and the new CamundaExporter enabled. We also made sure Raft leadership was balanced before starting the test, meaning each broker is leader for exactly one partition, and we turned on partition scaling by adding the following environment variable:

  • ZEEBE_BROKER_EXPERIMENTAL_FEATURES_ENABLEPARTITIONSCALING=true

Each broker also has a SSD-class volume with 32GB of disk space, limiting them to a few thousand IOPS. The processing load was 150 processes per second, with a large payload of 32KiB each. Each process instance has a single service task:

one-task

The processing load is generated by our own benchmarking application.

Initial cluster configuration

To test this hypothesis, we will start with a standard configuration of the Camunda orchestration cluster:

  • 3 nodes
  • 3 partitions
  • CPU limit: 2
  • Memory limit: 2 GB

We will increase the load through a load generator in fixed increments until we start to see the nodes showing constant non zero backpressure, which is a sign that the system has hit its throughput limits.

Target cluster configuration

Once that level of throughput is increased, we will scale broker & partitions while the cluster is under load to the new target value:

  • 6 nodes
  • 6 partitions
  • CPU limit: 2
  • Memory limit: 2 GB

Experiment

We expect that during the scaling operation the backpressure/latencies might worsen, but only temporarily, as once the scaling operation has completed, the additional load it generate is not present anymore.

Then, we will execute the same procedure as above, until we hit 2x the critical throughput hit before.

Expectation

If the system scales linearly, we expect to see similar level of performance metrics for similar values of the ratios PI (created/complete) per second / nr. of partition.

Steady state

The system is started with a throughput of 150 Process instances created per second. As this is a standard benchmark configuration, nothing unexpected happens:

  • The same number of process instances are completed as the ones created
  • The expected number of jobs is completed per unit of time

At this point, we have the following topology:

initial-topology

First benchmark: 3 broker and 3 partitions

Let's start increasing the load incrementally, by adding 30 Process instances/s for every step.

TimeBrokersPartitionsThroughputCPU UsageThrottling (CPU)Backpressure
09:3033150 PI/s, 150 jobs/s1.28 / 1.44 / 1.0212% / 7% / 1%0
09:4933180 PI/s, 180 jobs/s1.34 / 1.54 / 1.1220% / 17% / 2%0
10:0033210 PI/s, 210 jobs/s1.79 / 1.62 / 1.3328% / 42% / 4%0
10:1233240 PI/s, 240 jobs/s1.77 / 1.95 / 1.6245% / 90% / 26%0/0.5%

At 240 Process Instances spawned per second, the system starts to hit the limits: CPU usage @ 240 PI/s CPU throttling@ 240 PI/s

And the backpressure is not zero anymore: Backpressure @ 240 PI/s

  • The CPU throttling reaches almost 90% on one node (this is probably caused by only one node being selected as gateway as previously noted)
  • Backpressure is now constantly above zero, even if it's just 0.5%, it's a sign that we are reaching the throughput limits.

Second part of the benchmark: scaling to 6 brokers and 6 partitions

With 240 process instances per second being spawned, we send the commands to scale the cluster.

We first scale the zeebe statefulset to 6 brokers. As soon as the new brokers are running, even before they are healthy, we can send the command to include them in the cluster and to increase the number of partition to 6.

This can be done following the guide in the official docs.

Once the scaling has been completed, as can be seen from the Cluster operation section in the dashboard, we see the newly created partitions participate in the workload.

We now have the following topology:

six-partitions-topology

As we did before, let's start increasing the load incrementally as we did with the other cluster configuration.

TimeBrokersPartitionsThroughputCPU UsageThrottling (CPU)BackpressureNotes
10:2766240 PI/s0.92/1.26/0.74/0.94/0.93/0.932.8/6.0/0.3/2.8/3.4/3.180After scale up
11:0566300 PI/s1.17/1.56/1.06/1.23/1.19/1.189%/29%/0.6%/9%/11%/10%0Stable
11:1066360 PI/s1.39/1.76/1.26/1.43/1.37/1.4219%/42%/2%/16%/21%/22%0Stable
11:1066420 PI/s1.76/1.89/1.50/1.72/1.50/1.7076%/84%/52%/71%/60%/65%0 (spurts on 1 partition)Pushing hard

However, at 11:32 one of the workers restarted, causing a spike in the processing due to jobs being yielded back to the engine, less jobs to be activated, and thus less to be completed. This caused a job backlog to build up in the engine. Once the worker restarted, the backlog was drained, leading to a spike in job completion requests: around 820 req/s, as opposed to the expected 420 req/s.

Because of this extra load, the cluster started to consume even more CPU, resulting in heavy CPU throttling from the cloud provider.

CPU usage @ 420 PI/s CPU throttling @ 420 PI/s

On top of this, eventually a broker restarted (most likely as we run on spot VMs). In order to continue with our test, we scaled the load down to 60 PI/s to give the cluster the time to heal.

Once the cluster was healthy again, we raised the throughput back to 480 PI/s to verify the scalability with twice as much throughput as the initial configuration.

The cluster was able to sustain 480 process instances per second with similar levels of backpressure of the initial configuration:

Backpressure @ 480 PI/s

We can see below that CPU usage is high, and there is still some throttling, indicating we might be able to do more with a little bit of vertical scaling, or by scaling out and reducing the number of partitions per broker:

CPU usage @ 480 PI/s CPU throttling

Conclusion

We were able to verify that the cluster can scale almost linearly with new brokers and partitions, so long as the other components, like the secondary storage, workers, connectors, etc., are able to sustain a similar.

In particular, making sure that the secondary storage is able to keep up with the throughput turned out to be crucial to keep the cluster stable in order to avoid filling up the Zeebe disks, which would bring to a halt the cluster.

We encountered a similar issue when one worker restarts: initially it creates a backlog of unhandled jobs, which turns into a massive increase in requests per second when the worker comes back, as it starts activating jobs faster than the cluster can complete them.

Finally, with this specific test, it would be interesting to explore the limits of vertical scalability, as we often saw CPU throttling being a major blocker for processing. This would make for an interesting future experiment.

Impact of Camunda Exporter on processing performance

· 5 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In our last Chaos day we experimented with the Camunda Exporter MVP. After our MVP we continued with Iteration 2, where we migrated the Archiver deployments and added a new Migration component (allows us to harmonize indices).

Additionally, some fixes and improvements have been done to the realistic benchmarks that should allow us to better compare the general performance with a realistic good performing benchmark.

Actually, this is what we want to explore and experiment with today.

  • Does the Camunda Exporter (since the last benchmark) impact performance of the overall system?
    • If so how?
  • How can we potentially mitigate this?

TL;DR; Today's, results showed that enabling the Camunda Exporter causes a 25% processing throughput drop. We identified the CPU as a bottleneck. It seems to be mitigated by either adjusting the CPU requests or removing the ES exporter. With these results, we are equipped to make further investigations and decisions.

Camunda Exporter MVP

· 7 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

After a long pause, I come back with an interesting topic to share and experiment with. Right now we are re-architecture Camunda 8. One important part (which I'm contributing to) is to get rid of Webapps Importer/Archivers and move data aggregation closer to the engine (inside a Zeebe Exporter).

Today, I want to experiment with the first increment/iteration of our so-called MVP. The MVP targets green field installations where you simply deploy Camunda (with a new Camunda Exporter enabled) without Importers.

TL;DR; All our experiments were successful. The MVP is a success, and we are looking forward to further improvements and additions. Next stop Iteration 2: Adding Archiving historic data and preparing for data migration (and polishing MVP).

Camunda Exporter

The Camunda Exporter project deserves a complete own blog post, here is just a short summary.

Our current Camunda architecture looks something like this (simplified).

current

It has certain challenges, like:

  • Space: duplication of data in ES
  • Maintenance: duplication of importer and archiver logic
  • Performance: Round trip (delay) of data visible to the user
  • Complexity: installation and operational complexity (we need separate pods to deploy)
  • Scalability: The Importer is not scalable in the same way as Zeebe or brokers (and workload) are.

These challenges we obviously wanted to overcome and the plan (as mentioned earlier) is to get rid of the need of separate importers and archivers (and in general to have separate application; but this is a different topic).

The plan for this project looks something like this:

plan

We plan to:

  1. Harmonize the existing indices stored in Elasticsearch/Opensearch
    • Space: Reduce the unnecessary data duplication
  2. Move importer and archiver logic into a new Camunda exporter
    • Performance: This should allow us to reduce one additional hop (as we don't need to use ES/OS as a queue)
    • Maintenance: Indices and business logic is maintained in one place
    • Scalability: With this approach, we can scale with partitions, as Camunda Exporters are executed for each partition separately (soon partition scaling will be introduced)
    • Complexity: The Camunda Exporter will be built-in and shipped with Zeebe/Camunda 8. No additional pod/application is needed.

Note: Optimize is right now out of scope (due to time), but will later be part of this as well.

MVP

After we know what we want to achieve what is the Minimum viable product (MVP)?

We have divided the Camunda Exporter in 3-4 iterations. You can see and read more about this here.

The first iteration contains the MVP (the first breakthrough). Providing the Camunda Exporter with the basic functionality ported from the Operate and Tasklist importers, writing into harmonized indices.

The MVP is targeting green field installations (clean installations) of Camunda 8 with Camunda Exporter without running the old Importer (no data migration yet),

mvp

Optimizing cluster sizing using a real world benchmark

· 6 min read
Rodrigo Lopes
Associate Software Engineer @ Zeebe

Our first goal of this experiment is to use a benchmarks to derive new optimized cluster configuration that can handle at least 100 tasks per second, while maintaining low backpressure and low latency.

For our experiment, we use a newly defined realistic benchmark (with a more complex process model). More about this in a separate blog post.

The second goal is to scale out optimized cluster configuration resources linearly and see if the performance scales accordingly.

TL;DR;

We used a realistic benchmark to derive a new cluster configuration based on previous requirements.

When we scale this base configuration linearly we see that the performance increases almost linearly as well, while maintaining low backpressure and low latency.

Improve Operate import latency

· 11 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In our last Chaos Day we experimented with Operate and different load (Zeebe throughput). We observed that a higher load caused a lower import latency in Operate. The conclusion was that it might be related to Zeebe's exporting configuration, which is affected by a higher load.

In today's chaos day we want to verify how different export and import configurations can affect the importing latency.

TL;DR; We were able to decrease the import latency by ~35% (from 5.7 to 3.7 seconds), by simply reducing the bulk.delay configuration. This worked on low load and even higher load, without significant issues.

Operate load handling

· 8 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

🎉 Happy to announce that we are broadening the scope of our Chaos days, to look holistically at the whole Camunda Platform, starting today. In the past Chaos days we often had a close look (or concentrated mostly) at Zeebe performance and stability.

Today, we will look at the Operate import performance and how Zeebe processing throughput might affect (or not?) the throughput and latency of the Operate import. Is it decoupled as we thought?

The import time is an important metric, representing the time until data from Zeebe processing is visible to the User (excluding Elasticsearch's indexing). It is measured from when the record is written to the log, by the Zeebe processor, until Operate reads/imports it from Elasticsearch and converts it into its data model. We got much feedback (and experienced this on our own) that Operate is often lagging behind or is too slow, and of course we want to tackle and investigate this further.

The results from this Chaos day and related benchmarks should allow us to better understand how the current importing of Operate performs, and what its affects. Likely it will be a series of posts to investigate this further. In general, the data will give us some guidance and comparable numbers for the future to improve the importing time. See also related GitHub issue #16912 which targets to improve such.

TL;DR; We were not able to show that Zeebe throughput doesn't affect Operate importing time. We have seen that Operate can be positively affected by the throughput of Zeebe. Surprisingly, Operate was faster to import if Zeebe produced more data (with a higher throughput). One explanation of this might be that Operate was then less idle.

Broker Scaling and Performance

· 6 min read
Lena Schönburg
Senior Software Engineer @ Zeebe
Deepthi Akkoorath
Senior Software Engineer @ Zeebe

With Zeebe now supporting the addition and removal of brokers to a running cluster, we wanted to test three things:

  1. Is there an impact on processing performance while scaling?
  2. Is scaling resilient to high processing load?
  3. Can scaling up improve processing performance?

TL;DR; Scaling up works even under high load and has low impact on processing performance. After scaling is complete, processing performance improves in both throughput and latency.

Dynamically scaling brokers

· 6 min read
Lena Schönburg
Senior Software Engineer @ Zeebe

We experimented with the first version of dynamic scaling in Zeebe, adding or removing brokers for a running cluster.

Scaling up and down is a high-level operation that consists of many steps that need to be carried co-operatively by all brokers in the cluster. For example, adding new brokers first adds them to the replication group of the assigned partitions and then removes some of the older brokers from the replication group. Additionally, priorities need to be reconfigured to ensure that the cluster approaches balanced leadership eventually.

This orchestration over multiple steps ensures that all partitions are replicated by at least as many brokers as configured with the replicationFactor. As always, when it comes to orchestrating distributed systems, there are many edge cases and failure modes to consider.

The goal of this experiment was to verify that the operation is resilient to broker restarts. We can accept that operations take longer than usual to complete, but we need to make sure that the operation eventually succeeds with the expected cluster topology as result.

TL;DR; Both scaling up and down is resilient to broker restarts, with the only effect that the operation takes longer than usual to complete.