15 posts tagged with "performance"

Operate load handling

August 16, 2024 · 8 min read

Principal Software Engineer @ Camunda

🎉 Happy to announce that we are broadening the scope of our Chaos days, to look holistically at the whole Camunda Platform, starting today. In the past Chaos days we often had a close look (or concentrated mostly) at Zeebe performance and stability.

Today, we will look at the Operate import performance and how Zeebe processing throughput might affect (or not?) the throughput and latency of the Operate import. Is it decoupled as we thought?

The import time is an important metric, representing the time until data from Zeebe processing is visible to the User (excluding Elasticsearch's indexing). It is measured from when the record is written to the log, by the Zeebe processor, until Operate reads/imports it from Elasticsearch and converts it into its data model. We got much feedback (and experienced this on our own) that Operate is often lagging behind or is too slow, and of course we want to tackle and investigate this further.

The results from this Chaos day and related benchmarks should allow us to better understand how the current importing of Operate performs, and what its affects. Likely it will be a series of posts to investigate this further. In general, the data will give us some guidance and comparable numbers for the future to improve the importing time. See also related GitHub issue #16912 which targets to improve such.

TL;DR; We were not able to show that Zeebe throughput doesn't affect Operate importing time. We have seen that Operate can be positively affected by the throughput of Zeebe. Surprisingly, Operate was faster to import if Zeebe produced more data (with a higher throughput). One explanation of this might be that Operate was then less idle.

Broker Scaling and Performance

December 20, 2023 · 6 min read

Lena Schönburg

Senior Software Engineer @ Zeebe

Deepthi Akkoorath

Principal Software Engineer @ Camunda

With Zeebe now supporting the addition and removal of brokers to a running cluster, we wanted to test three things:

Is there an impact on processing performance while scaling?
Is scaling resilient to high processing load?
Can scaling up improve processing performance?

TL;DR; Scaling up works even under high load and has low impact on processing performance. After scaling is complete, processing performance improves in both throughput and latency.

Dynamically scaling brokers

December 18, 2023 · 6 min read

Lena Schönburg

Senior Software Engineer @ Zeebe

We experimented with the first version of dynamic scaling in Zeebe, adding or removing brokers for a running cluster.

Scaling up and down is a high-level operation that consists of many steps that need to be carried co-operatively by all brokers in the cluster. For example, adding new brokers first adds them to the replication group of the assigned partitions and then removes some of the older brokers from the replication group. Additionally, priorities need to be reconfigured to ensure that the cluster approaches balanced leadership eventually.

This orchestration over multiple steps ensures that all partitions are replicated by at least as many brokers as configured with the replicationFactor. As always, when it comes to orchestrating distributed systems, there are many edge cases and failure modes to consider.

The goal of this experiment was to verify that the operation is resilient to broker restarts. We can accept that operations take longer than usual to complete, but we need to make sure that the operation eventually succeeds with the expected cluster topology as result.

TL;DR; Both scaling up and down is resilient to broker restarts, with the only effect that the operation takes longer than usual to complete.

Worker count should not impact performance

November 24, 2021 · 3 min read

Christopher Kujawa

Principal Software Engineer @ Camunda

In this chaos day we experimented with the worker count, since we saw recently that it might affect the performance (throughput) negatively if there are more workers deployed. This is related to #7955 and #8244.

We wanted to prove, that even if we have more workers deployed the throughput of the process instance execution should not have an negative impact.

TL;DR; We were not able to prove our hypothesis. Scaling of workers can have a negative impact on performance. Check out the third chaos experiment.

Throughput on big state

October 29, 2021 · 4 min read

Christopher Kujawa

Principal Software Engineer @ Camunda

In this chaos day we wanted to prove the hypothesis that the throughput should not significantly change even if we have bigger state, see zeebe-chaos#64

This came up due observations from the last chaos day. We already had a bigger investigation here zeebe#7955.

TL;DR; We were not able to prove the hypothesis. Bigger state, more than 100k+ process instances in the state, seems to have an big impact on the processing throughput.