Gateway Termination
In today's chaos day, we wanted to experiment with the gateway and resiliency of workers.
We have seen in recent weeks some issues within our benchmarks when gateways have been restarted, see zeebe#11975.
We did a similar experiment in the past, today we want to focus on self-managed (benchmarks with our helm charts). Ideally, we can automate this as well soon.
Today Nicolas joined me on the chaos day ๐
TL;DR; We were able to show that the workers (clients) can reconnect after a gateway is shutdown โ Furthermore, we have discovered a potential performance issue on lower load, which impacts process execution latency (zeebe#12311).
Chaos Experimentโ
We will use our Zeebe benchmark helm charts to set up the test cluster, and our helper scripts here.
Setup:โ
We will run with the default benchmark configuration, which means:
- three brokers
- three partitions
- replication count three
- two gateways
We will run the benchmark with a low load, 10 process instances per second created and completed. For that, we deploy one starter and worker. This reduces the blast radius and allows us to observe more easily how the workers behave when a gateway is restarted.
During the experiment, we will use our grafana dashboard to observe to which gateway the worker will connect and which gateway we need to stop/restart.
LAST DEPLOYED: Thu Apr 6 10:21:27 2023
NAMESPACE: zell-chaos
STATUS: deployed
REVISION: 1
NOTES:
# Zeebe Benchmark
Installed Zeebe cluster with:
* 3 Brokers
* 2 Gateways
The benchmark is running with:
* Starter replicas=1
* Worker replicas=1
* Publisher replicas=0
* Timer replicas=0
Expectedโ
When we terminate a gateway to which the worker has connected, we expect that the worker connects to the different replica and starts completing jobs again.
The performance drop is expected to be not significant, or at least should recover fast.