35 posts tagged with "availability"

Deployment Distribution

January 26, 2021 · 11 min read

Chaos Engineer @ Zeebe

On this chaos day we wanted to experiment a bit with deployment's and there distribution.

We run a chaos experiment with deploying multiple workflows and disconnecting two leaders. We verified whether deployments are distributed later. The chaos experiment was successful and showed a bit how fault tolerant deployment distribution is. 💪

Network partitions

January 19, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

As you can see, I migrated the old chaos day summaries to github pages, for better readability. I always wanted to play around with github pages and jekyll so this was a good opportunity. I hope you like it. 😄

On the last Chaos Day, we experimented with disconnecting a Leader and one follower. We expected no bigger disturbance, since we still have quorum and can process records. Today I want to experiment with bigger network partitions.

In the first chaos experiment: I had a cluster of 5 nodes and split that into two groups, the processing continued as expected, since we had still quorum. 💪
In the second chaos experiment: I split the cluster again into two groups, but this time we added one follower of the bigger group to the smaller group after snapshot was taken and compaction was done. The smaller group needed to keep up with the newer state, before new processing can be started again, but everything worked fine.

Disconnect Leader and one Follower

January 7, 2021 · 8 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Happy new year everyone 🎉

This time I wanted to verify the following hypothesis Disconnecting Leader and one Follower should not make cluster disruptive (#45). But in order to do that we need to extract the Leader and Follower node for a partition from the Topology. Luckily in December we got an external contribution which allows us to print zbctl status as json. This gives us now more possibilities, since we can extract values much better out of it.

TL;DR The experiment was successful 👍

Message Correlation after Failover

November 24, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

Today I wanted to finally implement an experiment which I postponed for long time, see #24. The problem was that previous we were not able to determine on which partition the message was published, so we were not able to assert that it was published on the correct partition. With this #4794 it is now possible, which was btw an community contribution. 🎉

Many Job Timeouts

November 11, 2020 · 4 min read

Christopher Kujawa

Chaos Engineer @ Zeebe

In the last game day (on friday 06.11.2020) I wanted to test whether we can break a partition if many messages time out at the same time. What I did was I send many many messages with a decreasing TTL, which all targeting a specific point in time, such that they will all timeout at the same time. I expected that if this happens that the processor will try to time out all at once and break because the batch is to big. Fortunately this didn't happen, the processor was able to handle this.

I wanted to verify the same with job time out's.