A while ago we have written an experiment, which should verify that followers are not able to become leader, if they have a corrupted snapshot. You can find that specific experiment here. This experiment was executed regularly against Production-M and Production-S Camunda Cloud cluster plans. With the latest changes, in the upcoming 1.0 release, we changed some behavior in regard to detect snapshot corruption on followers.
NEW If a follower is restarted and has a corrupted snapshot it will detect it on bootstrap and will refuse to
start related services and crash. This means the pod will end in a crash loop, until this is manually fixed.
OLD The follower only detects the corrupted snapshot on becoming leader when opening the database. On the restart of a follower this will not be detected.
The behavior change caused to fail our automated chaos experiments, since we corrupt the snapshot on followers and on a later experiment we restart followers. For this reason we had to disable the execution of the snapshot corruption experiment, see related issue
zeebe-io/zeebe-cluster-testbench#303.
In this chaos day we wanted to investigate whether we can improve the experiment and bring it back. For reference, I also opened a issue to discuss the current corruption detection approach zeebe#6907