High Snapshot Frequency
Today we wanted to experiment with the snapshot interval and verify that a high snapshot frequency will not impact our availability (#21).
TL;DR; The chaos experiment succeeded 馃挭 We were able to prove our hypothesis.
Today we wanted to experiment with the snapshot interval and verify that a high snapshot frequency will not impact our availability (#21).
TL;DR; The chaos experiment succeeded 馃挭 We were able to prove our hypothesis.
New Year;:tada:New Chaos馃悞
This time I wanted to experiment with "big" variables. Zeebe supports a maxMessageSize
of 4 MB, which is quite big. In general, it should be clear that using big variables will cause performance issues, but today I also want to find out whether the system can handle big variables (~1 MB) at all.
TL;DR; Our Chaos experiment failed! Zeebe and Camunda Cloud is not able to handle (per default) big variables (~1 MB) without issues.
In the last quarter we worked on a new "feature" which is called "building state on followers". In short, it means that the followers apply the events to build there state, which makes regular snapshot replication unnecessary and allows faster role transition between Follower-to-Leader. In this chaos day I wanted to experiment a bit with this property, we already did some benchmarks here. Today, I want to see how it behaves with larger state (bigger snapshots), since this needed to be copied in previous versions of Zeebe, and the broker had to replay more than with the newest version.
If you want to now more about build state on followers check out the ZEP
TL;DR; In our experiment we had almost no downtime, with version 1.2, the new leader was very fast able to pick up the next work (accept new commands).
It has been awhile since the last post, I'm happy to be back.
In today's chaos day we want to verify the hypothesis from zeebe-chaos#34 that old clients can't disrupt a running cluster.
It might happen that after upgrading your Zeebe to the newest shiny version, you might forget to update some of your workers or starters etc. This should normally not an issue since Zeebe is backwards compatible, client wise since 1.x. But what happens when older clients are used. Old clients should not have a negative effect on a running cluster.
TLDR Older clients (0.26) have no negative impact on a running cluster (1.2), and clients after 1.x are still working with the latest version.
On a previous Chaos Day we played around with ToxiProxy , which allows injecting failures on the network level. For example dropping packages, causing latency etc.
Last week @Deepthi mentioned to me that we can do similar things with tc, which is a built-in linux command. Today I wanted to experiment with latency between leader and followers using tc
.
TL;DR; The experiment failed; With adding 100ms network delay to the Leader we broke the complete processing throughput. 馃挜
On this chaos day we wanted to experiment with OOD recovery and ELS connection issues. This is related to the following issues from our hypothesis backlog: zeebe-chaos#32 and zeebe-chaos#14. This time @Nico joined me.
TL;DR The experiment was successful 馃挭 and we found several things in the dashboard which we can improve :)
A while ago we have written an experiment, which should verify that followers are not able to become leader, if they have a corrupted snapshot. You can find that specific experiment here. This experiment was executed regularly against Production-M and Production-S Camunda Cloud cluster plans. With the latest changes, in the upcoming 1.0 release, we changed some behavior in regard to detect snapshot corruption on followers.
NEW If a follower is restarted and has a corrupted snapshot it will detect it on bootstrap and will refuse to start related services and crash. This means the pod will end in a crash loop, until this is manually fixed.
OLD The follower only detects the corrupted snapshot on becoming leader when opening the database. On the restart of a follower this will not be detected.
The behavior change caused to fail our automated chaos experiments, since we corrupt the snapshot on followers and on a later experiment we restart followers. For this reason we had to disable the execution of the snapshot corruption experiment, see related issue zeebe-io/zeebe-cluster-testbench#303.
In this chaos day we wanted to investigate whether we can improve the experiment and bring it back. For reference, I also opened a issue to discuss the current corruption detection approach zeebe#6907
Today I wanted to add another chaos experiment, to increase our automated chaos experiments collection. This time we will deploy a process model (with timer start event), restart a node and complete the process instance via zbctl
.
TL;DR;
I was able to create the chaos toolkit experiment. It shows us that we are able to restore our state after fail over, which means we can trigger timer start events to create process instances even if they have been deployed before fail-over. Plus we are able to complete these instances.
On this chaos day we wanted to experiment a bit with deployment's and there distribution.
We run a chaos experiment with deploying multiple workflows and disconnecting two leaders. We verified whether deployments are distributed later. The chaos experiment was successful and showed a bit how fault tolerant deployment distribution is. 馃挭
As you can see, I migrated the old chaos day summaries to github pages, for better readability. I always wanted to play around with github pages and jekyll so this was a good opportunity. I hope you like it. 馃槃
On the last Chaos Day, we experimented with disconnecting a Leader and one follower. We expected no bigger disturbance, since we still have quorum and can process records. Today I want to experiment with bigger network partitions.