Broker Scaling and Performance
With Zeebe now supporting the addition and removal of brokers to a running cluster, we wanted to test three things:
- Is there an impact on processing performance while scaling?
- Is scaling resilient to high processing load?
- Can scaling up improve processing performance?
TL;DR; Scaling up works even under high load and has low impact on processing performance. After scaling is complete, processing performance improves in both throughput and latency.
Impact of scaling on processing performance
Scaling up and down is an expensive operation where partition data is transferred between brokers, and leadership for partitions changes. We wanted to test how much impact this has on regular processing performance.
To do this, we ran a benchmark with 3 brokers, 6 partitions and replication factor 3.
The brokers are limited to 1.35 CPUs and 4GiB RAM each. They run with additional safety checks that are usually disabled in production and that slightly decrease the baseline processing performance. Each broker uses a small 32GiB SSD for storage, limiting them to a few thousand IOPS.
The processing load was 150 processes per second, with a large payload of 32KiB each. Each process instance has a single service task:
The processing load is generated by our own benchmarking application.
Expected
When we scale up from 3 to 6 brokers, we expect a small impact on processing performance. Request latency may increase slightly, some requests may time out and some will be rejected due to backpressure. The overall throughput in terms of created and completed process instances as well as jobs may similarly decrease slightly.