Skip to main content

35 posts tagged with "availability"

View All Tags

Follow up REST API performance

· 20 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

Investigating REST API performance

This post collates the experiments, findings, and lessons learned during the REST API performance investigation.

There wasn't one explicit root cause identified. As it is often the case with such performance issues, it is the combination of several things.

Quint essence: REST API is more CPU intense/heavy than gRPC. You can read more about this in the conclusion part. We have discovered ~10 issues we have to follow up with, where at least 2-3 might have a significant impact in the performance. Details can be found in the Discovered issues section

Performance of REST API

· 7 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In today's Chaos day we wanted to experiment with the new REST API (v2) as a replacement for our previous used gRPC API.

Per default, our load tests make use of the gRPC, but as we want to make REST API the default and release this fully with 8.8, we want to make sure to test this accordingly in regard to reliability.

TL;DR; We observed severe performance regression when using the REST API, even when job streaming is in use by the job workers (over gRPC). Our client seems to have a higher memory consumption, which caused some instabilities in our tests as well. With the new API, we lack certain observability, which makes it harder to dive into certain details. We should investigate this further and find potential bottlenecks and improvements.

general

How does Zeebe behave with NFS

· 13 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This week, we (Lena, Nicolas, Roman, and I) held a workshop where we looked into how Zeebe behaves with network file storage (NFS).

We ran several experiments with NFS and Zeebe, and messing around with connectivity.

TL;DR; We were able to show that NFS can handle certain connectivity issues, just causing Zeebe to process slower. IF we completely lose the connection to the NFS server, several issues can arise, like IOExceptions on flush (where RAFT goes into inactive mode) or SIGBUS errors on reading (like replay), causing the JVM to crash.

Lower memory consumption of Camunda deployment

· 9 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

I'm back to finally do some load testing again.

In the past months, we have changed our architecture. This was to deploy instead all of our components as a separate deployment, we now have one single statefulset. This statefulset is running our single Camunda standalone application, combining all components together.

simpler deployment

More details on this change we will share on a separate blog post. For simplicity, in our load tests (benchmark helm charts), we combined all the resources we had split over multiple deployments together, see related PR #213.

We are currently running our test with the following resources by default:

    Limits:
cpu: 2
memory: 12Gi
Requests:
cpu: 2
memory: 6Gi

In today's Chaos day, I want to look into our resource consumption and whether we can reduce our used requests and limits.

TL;DR; We have focused on experimenting with different memory resources, and were able to show that we can reduce the used memory by 75%, and our previous provisioned resources by more than 80% for our load tests.

News from Camunda Exporter project

· 4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In this Chaos day, we want to verify the current state of the exporter project and run benchmarks with it. Comparing with a previous version (v8.6.6) should give us a good hint on the current state and potential improvements.

TL;DR; The latency of user data availability has improved due to our architecture change, but we still need to fix some bugs before our planned release of the Camunda Exporter. This experiment allows us to detect three new bugs, fixing this should allow us to make the system more stable.

Using flow control to handle bottleneck on exporting

· 5 min read
Rodrigo Lopes
Associate Software Engineer @ Zeebe

Zeebe 8.6 introduces a new unified flow control mechanism that is able to limit user commands (by default it tries to achieve 200ms response times) and rate limit writes of new records in general (disabled by default). Limiting the write rate is a new feature that can be used to prevent building up an excessive exporting backlog. There are two ways to limit the write rate, either by setting a static limit or by enabling throttling that dynamically adjust the write rate based on the exporting backlog and rate. In these experiments, we will test both ways of limiting the write rate and observe the effects on processing and exporting.

TL;DR; Both setting a static write rate limit and enabling throttling of the write rate can be used to prevent building up an excessive exporting backlog. For users, this will be seen as backpressure because processing speed is limited by the rate at which it can write processing results.

Using flow control to handle uncontrolled process loops

· 6 min read
Rodrigo Lopes
Associate Software Engineer @ Zeebe

Zeebe 8.6 introduces a new unified flow control mechanism that is able to limit user commands (by default it tries to achieve 200ms response times) and rate limit writes of new records in general (disabled by default).

Limiting the write rate is a new feature that can be used to prevent building up an excessive exporting backlog.

In these experiments we will test what happens with the deployment of endless loops that result in high processing load, and how we can use the new flow control to keep the cluster stable.

TL;DR;

Enabling the write rate limiting can help mitigate the effects caused by process instances that contain uncontrolled loops by preventing building up an excessive exporting backlog.

Reducing the job activation delay

· 12 min read
Nicolas Pepin-Perreault
Senior Software Engineer @ Zeebe

With the addition of end-to-end job streaming capabilities in Zeebe, we wanted to measure the improvements in job activation latency:

  • How much is a single job activation latency reduced?
  • How much is the activation latency reduced between each task of the same process instance?
  • How much is the activation latency reduced on large clusters with a high broker and partition count?

Additionally, we wanted to guarantee that every component involved in streaming, including clients, would remain resilient in the face of load surges.

TL;DR; Job activation latency is greatly reduced, with task based workloads seeing up to 50% reduced overall execution latency. Completing a task now immediately triggers pushing out the next one, meaning the latency to activate the next task in a sequence is bounded by how much time it takes to process its completion in Zeebe. Activation latency is unaffected by how many partitions or brokers there in a cluster, as opposed to job polling, thus ensuring scalability of the system. Finally, reuse of gRPC's flow control mechanism ensure clients cannot be overloaded even in the face of load surges, without impacting other workloads in the cluster.

Head over to the documentation to learn how to start using job push!

Broker Scaling and Performance

· 6 min read
Lena Schönburg
Senior Software Engineer @ Zeebe
Deepthi Akkoorath
Senior Software Engineer @ Zeebe

With Zeebe now supporting the addition and removal of brokers to a running cluster, we wanted to test three things:

  1. Is there an impact on processing performance while scaling?
  2. Is scaling resilient to high processing load?
  3. Can scaling up improve processing performance?

TL;DR; Scaling up works even under high load and has low impact on processing performance. After scaling is complete, processing performance improves in both throughput and latency.

Dynamic Scaling with Dataloss

· 5 min read
Lena Schönburg
Senior Software Engineer @ Zeebe

We continue our previous experiments with dynamically scaling by now also testing whether the cluster survives dataloss during the process.

One goal is to verify that we haven't accidentally introduced a single point of failure in the cluster. Another is to ensure that data loss does not corrupt the cluster topology.

TL;DR; Even with dataloss, the scaling completes successfully and with the expected results. We found that during scaling, a single broker of the previous cluster configuration can become a single point of failure by preventing a partition from electing a leader. This is not exactly a bug, but something that we want to improve.