Skip to main content

BPMN meets Chaos Engineering

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

On the first of April (2021) we ran our Spring Hackday at Camunda. This is an event where the developers at camunda come together to work on projects they like or on new ideas/approaches they want to try out. This time we (Philipp and me) wanted to orchestrate our Chaos Experiments with BPMN. If you already know how we automated our chaos experiments before, you can skip the next section and jump directly to the Hackday Project section.

In order to understand this blogpost make sure that you have a little understanding of Zeebe, Camunda Cloud and Chaos Engineering. Read the following resources to get a better understanding.

Set file immutable

7 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This chaos day was a bit different. Actually I wanted to experiment again with camunda cloud and verify that our high load chaos experiments are now working with the newest cluster plans, see zeebe-cluster-testbench#135. Unfortunately I found out that our test chaos cluster was in a way broken, that we were not able to create new clusters. Luckily this was fixed at the end of the day, thanks to @immi :)

Because of these circumstances I thought about different things to experiment with, and I remembered that in the last chaos day we worked with patching running deployments, in order to add more capabilities. This allowed us to create ip routes and experiment with the zeebe deployment distribution. During this I have read the capabilities list of linux, and found out that we can mark files as immutable, which might be interesting for a chaos experiment.

In this chaos day I planned to find out how marking a file immutable affects our brokers and I made the hypothesis that: If a leader has a write error, which is not recoverable, it will step down and another leader should take over. I put this in our hypothesis backlog (zeebe-chaos#52).

In order to really run this kind of experiment I need to find out whether marking a file immutable will cause any problems and if not how I can cause write errors such that affects the broker. Unfortunately it turned out that immutable files will not cause issues on already opened file channels, but I found some other bugs/issues, which you can read below.

In the next chaos days I will search for a way to cause write errors proactively, so we can verify that our system can handle such issues.

Camunda Cloud network partition

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This time Deepthi was joining me on my regular Chaos Day. 馃帀

In the second last chaos day I created an automated chaos experiment, which verifies that the deployments are distributed after a network partition. Later it turned out that this doesn't work for camunda cloud, only for our helm setup. The issue was that on our camunda cloud zeebe clusters we had no NET_ADMIN capability to create ip routes (used for the network partitions). After discussing with our SRE's they proposed a good way to overcome this. On running chaos experiments, which are network related, we will patch our target cluster to add this capability. This means we don't need to add such functionality in camunda cloud and the related zeebe operate/controller. Big thanks to Immi and David for providing this fix.

TL;DR;

We were able to enhance the deployment distribution experiment and run it in the camunda cloud via testbench. We have enabled the experiment for Production M and L cluster plans. We had to adjust the rights for the testbench service account to make this work.

Fault-tolerant processing of process instances

6 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today I wanted to add another chaos experiment, to increase our automated chaos experiments collection. This time we will deploy a process model (with timer start event), restart a node and complete the process instance via zbctl.

TL;DR;

I was able to create the chaos toolkit experiment. It shows us that we are able to restore our state after fail over, which means we can trigger timer start events to create process instances even if they have been deployed before fail-over. Plus we are able to complete these instances.

Deployment Distribution

11 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

On this chaos day we wanted to experiment a bit with deployment's and there distribution.

We run a chaos experiment with deploying multiple workflows and disconnecting two leaders. We verified whether deployments are distributed later. The chaos experiment was successful and showed a bit how fault tolerant deployment distribution is. 馃挭

Network partitions

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

As you can see, I migrated the old chaos day summaries to github pages, for better readability. I always wanted to play around with github pages and jekyll so this was a good opportunity. I hope you like it. 馃槃

On the last Chaos Day, we experimented with disconnecting a Leader and one follower. We expected no bigger disturbance, since we still have quorum and can process records. Today I want to experiment with bigger network partitions.

  • In the first chaos experiment: I had a cluster of 5 nodes and split that into two groups, the processing continued as expected, since we had still quorum. 馃挭
  • In the second chaos experiment: I split the cluster again into two groups, but this time we added one follower of the bigger group to the smaller group after snapshot was taken and compaction was done. The smaller group needed to keep up with the newer state, before new processing can be started again, but everything worked fine.

Disconnect Leader and one Follower

8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Happy new year everyone 馃帀

This time I wanted to verify the following hypothesis Disconnecting Leader and one Follower should not make cluster disruptive (#45). But in order to do that we need to extract the Leader and Follower node for a partition from the Topology. Luckily in December we got an external contribution which allows us to print zbctl status as json. This gives us now more possibilities, since we can extract values much better out of it.

TL;DR The experiment was successful 馃憤

Message Correlation after Failover

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Today I wanted to finally implement an experiment which I postponed for long time, see #24. The problem was that previous we were not able to determine on which partition the message was published, so we were not able to assert that it was published on the correct partition. With this #4794 it is now possible, which was btw an community contribution. 馃帀

Many Job Timeouts

4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In the last game day (on friday 06.11.2020) I wanted to test whether we can break a partition if many messages time out at the same time. What I did was I send many many messages with a decreasing TTL, which all targeting a specific point in time, such that they will all timeout at the same time. I expected that if this happens that the processor will try to time out all at once and break because the batch is to big. Fortunately this didn't happen, the processor was able to handle this.

I wanted to verify the same with job time out's.