Skip to main content

Camunda Cloud network partition

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This time Deepthi was joining me on my regular Chaos Day. 馃帀

In the second last chaos day I created an automated chaos experiment, which verifies that the deployments are distributed after a network partition. Later it turned out that this doesn't work for camunda cloud, only for our helm setup. The issue was that on our camunda cloud zeebe clusters we had no NET_ADMIN capability to create ip routes (used for the network partitions). After discussing with our SRE's they proposed a good way to overcome this. On running chaos experiments, which are network related, we will patch our target cluster to add this capability. This means we don't need to add such functionality in camunda cloud and the related zeebe operate/controller. Big thanks to Immi and David for providing this fix.

TL;DR;

We were able to enhance the deployment distribution experiment and run it in the camunda cloud via testbench. We have enabled the experiment for Production M and L cluster plans. We had to adjust the rights for the testbench service account to make this work.

Fault-tolerant processing of process instances

6 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today I wanted to add another chaos experiment, to increase our automated chaos experiments collection. This time we will deploy a process model (with timer start event), restart a node and complete the process instance via zbctl.

TL;DR;

I was able to create the chaos toolkit experiment. It shows us that we are able to restore our state after fail over, which means we can trigger timer start events to create process instances even if they have been deployed before fail-over. Plus we are able to complete these instances.

Deployment Distribution

11 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

On this chaos day we wanted to experiment a bit with deployment's and there distribution.

We run a chaos experiment with deploying multiple workflows and disconnecting two leaders. We verified whether deployments are distributed later. The chaos experiment was successful and showed a bit how fault tolerant deployment distribution is. 馃挭

Network partitions

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

As you can see, I migrated the old chaos day summaries to github pages, for better readability. I always wanted to play around with github pages and jekyll so this was a good opportunity. I hope you like it. 馃槃

On the last Chaos Day, we experimented with disconnecting a Leader and one follower. We expected no bigger disturbance, since we still have quorum and can process records. Today I want to experiment with bigger network partitions.

  • In the first chaos experiment: I had a cluster of 5 nodes and split that into two groups, the processing continued as expected, since we had still quorum. 馃挭
  • In the second chaos experiment: I split the cluster again into two groups, but this time we added one follower of the bigger group to the smaller group after snapshot was taken and compaction was done. The smaller group needed to keep up with the newer state, before new processing can be started again, but everything worked fine.

Disconnect Leader and one Follower

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Happy new year everyone 馃帀

This time I wanted to verify the following hypothesis Disconnecting Leader and one Follower should not make cluster disruptive (#45). But in order to do that we need to extract the Leader and Follower node for a partition from the Topology. Luckily in December we got an external contribution which allows us to print zbctl status as json. This gives us now more possibilities, since we can extract values much better out of it.

TL;DR The experiment was successful 馃憤

Message Correlation after Failover

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today I wanted to finally implement an experiment which I postponed for long time, see #24. The problem was that previous we were not able to determine on which partition the message was published, so we were not able to assert that it was published on the correct partition. With this #4794 it is now possible, which was btw an community contribution. 馃帀

Many Job Timeouts

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In the last game day (on friday 06.11.2020) I wanted to test whether we can break a partition if many messages time out at the same time. What I did was I send many many messages with a decreasing TTL, which all targeting a specific point in time, such that they will all timeout at the same time. I expected that if this happens that the processor will try to time out all at once and break because the batch is to big. Fortunately this didn't happen, the processor was able to handle this.

I wanted to verify the same with job time out's.

Investigate failing Chaos Tests

5 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today as part of the Chaos Day I wanted to investigate why our current Chaos Tests are failing and why our targeting cluster has been broken by them, see the related issue #5688.

TL;DR

We found three new bugs regarding the reprocessing detection and deployment distribution, but still were not able to reproduce the real issue.

Non-graceful Shutdown Broker

2 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today I had not much time for the chaos day, because of writing Gameday Summary, Incident review, taking part of incidents etc. So enough chaos for one day :)

But I wanted to merge the PR from Peter and test how our brokers behave if they are not gracefully shutdown. I did that on Wednesday (21-10-2020).