We experimented with the first version of dynamic scaling in Zeebe, adding or removing brokers for a running cluster.
Scaling up and down is a high-level operation that consists of many steps that need to be carried co-operatively by all brokers in the cluster.
For example, adding new brokers first adds them to the replication group of the assigned partitions and then removes some of the older brokers from the replication group.
Additionally, priorities need to be reconfigured to ensure that the cluster approaches balanced leadership eventually.
This orchestration over multiple steps ensures that all partitions are replicated by at least as many brokers as configured with the replicationFactor
.
As always, when it comes to orchestrating distributed systems, there are many edge cases and failure modes to consider.
The goal of this experiment was to verify that the operation is resilient to broker restarts.
We can accept that operations take longer than usual to complete, but we need to make sure that the operation eventually succeeds with the expected cluster topology as result.
TL;DR; Both scaling up and down is resilient to broker restarts, with the only effect that the operation takes longer than usual to complete.