Skip to main content
Christopher Kujawa
Principal Software Engineer @ Camunda
View all authors

Building Confidence at Scale: How Camunda Ensures Platform Reliability Through Continuous Testing

· 8 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

As businesses increasingly rely on process automation for their critical operations, the question of reliability becomes paramount. How can you trust that your automation platform will perform consistently under pressure, recover gracefully from failures, and maintain performance over time?

At Camunda, we've been asking ourselves these same questions for years, and today I want to share how our reliability testing practices have evolved to ensure our platform meets the demanding requirements of enterprise-scale deployments. I will also outline our plans to further invest in this crucial area.

From Humble Beginnings to Comprehensive Testing

Our reliability testing journey began in early 2019 with what we then called "benchmarks" – simple load tests to validate basic performance of our Zeebe engine.

Over time, we recognized that running such benchmarks alone wasn't enough. We needed to ensure that Zeebe could handle real-world conditions, including failures and long-term operation. This realization led us to significantly expand our testing approach.

We introduced endurance tests that run for weeks, simulating sustained load to uncover memory leaks and performance degradation. These tests helped us validate that Zeebe could maintain its performance characteristics over extended periods of time. Investing in these endurance tests paid off, as we identified and resolved several critical issues that only manifested under prolonged load. Additionally, it allows us to build up experience on what a healthy system looks like and what we need to investigate faulty systems. With this, we were able to create Grafana dashboards that we can directly use to monitor our production systems and provide to our customers.

We embraced chaos engineering principles, developing a suite of chaos experiments to simulate failures in a controlled manner. We created zbchaos, an open-source fault injection tool tailored for Camunda, allowing us to automate and scale our chaos experiments. Automated chaos experiments now run daily against all supported versions of Camunda, covering a wide range of failure scenarios.

Additionally, we run semi-regular manual "chaos days" where we design and execute new chaos experiments, documenting our findings in our chaos engineering blog.

What started as a straightforward performance validation tool has evolved into a comprehensive framework that combines load testing, chaos engineering, and end-to-end testing. This evolution wasn't just about adding more tests. It reflected our growing understanding that reliability isn't a single metric but a multifaceted quality that emerges from systematic validation across different dimensions: performance under load, behavior during failures, and consistency over time.

We combine all of the above under the umbrella of what we now call "reliability testing." We define reliability testing as a type of software testing and practice that validates system performance and reliability. It can thus be done over time and with injection failure scenarios (injecting chaos).

If you are interested in more of the evolution of our reliability testing, I gave several Camunda Con Talks and wrote blog posts over the years that you might find interesting:

Why Reliability Testing Matters

We prepare customers for enterprise-scale operations. For this, we need to be confident in building a product that is fault-tolerant, reliable, and that performs well even under turbulent conditions.

For our customers running mission-critical processes, reliability testing provides several crucial benefits:

  • Proactive Issue Detection: We identify problems before they impact production environments. Memory leaks, performance degradation, and distributed system failures that only manifest under specific conditions are caught early in our testing cycles.
  • Confidence in Long-Term Operation: Our endurance tests validate that Camunda can run fault-free over extended periods, ensuring your automated processes won't degrade over time.
  • Graceful Failure Handling: Through chaos engineering, we verify that the platform handles failures elegantly, maintaining data consistency and recovering automatically when possible.
  • Performance Assurance: Continuous load testing ensures that Camunda meets performance expectations (e.g., number of Process Instances / second), even as new features are added and the codebase evolves.

Our Current Testing Arsenal

Today, our reliability testing encompasses two main pillars: load tests and chaos engineering.

Variations of Load Tests

We run different variants of load tests continuously:

  • Release Endurance Tests: Every supported version undergoes continuous endurance testing with artificial workloads, updated with each patch release
  • Weekly Endurance Tests: Based on our main branch, these tests run for four weeks to detect newly introduced instabilities or performance regressions
  • Daily Stress Tests: Shorter tests that validate the latest changes in our main branch under high load conditions

Our workload varies from artificial load (simple process definitions with minimal logic) to typical and realistic, complex processes that mimic real-world usage patterns.

Examples of such processes are:

typical process

complex process

Chaos Engineering

Since late 2019, we've embraced chaos engineering principles to build confidence in our system's resilience. Our approach includes:

  • Chaos Days: Regular events where we manually design and execute chaos experiments, documenting findings in our chaos engineering blog
  • Game Days: Regular events where we simulate an incident in our production SaaS environment to validate our incident response processes
  • Automated Chaos Experiments: Daily execution of 16 different chaos scenarios across all supported versions using our zbchaos tool. We drink our own champagne by using Camunda 8 to orchestrate our chaos experiments against Camunda.

Investing in the Future

With the foundation we’ve established through years of focused reliability testing on the Zeebe engine and its distributed architecture, we’re now expanding that maturity across the entire Camunda product. Our goal is to develop an even more robust and trustworthy product overall. To achieve this, we are consolidating the reliability testing efforts that have historically existed across individual components into a centralized team. This unified approach enables us to scale our testing capabilities more efficiently, ensure consistent best practices, and share insights across teams, ultimately strengthening the reliability of every part of the product.

Some of our upcoming initiatives driven by this team include:

  • Holistic Coverage: We're extending our reliability testing to cover all components of the Camunda 8 platform via a central reliability testing framework.
  • Chaos Engineering: We're planning to introduce new chaos experiments that simulate more complex failure modes, including network partitions, data corruption, and cascading failures.
  • Performance Optimization: Beyond maintaining performance, we utilize our testing infrastructure to identify optimization opportunities and validate improvements.
  • Enhanced Observability: Building on our extensive Grafana dashboards, we continually improve our ability to detect and diagnose issues quickly.
  • Establish Reliability Practices: We're formalizing reliability testing practices and guidelines that can be adopted across all engineering teams at Camunda.
  • Enablement: With the resources we want to enable all of our more than 15 product teams at Camunda to understand, implement, and execute reliability testing principles in their work. Allowing them to build more reliable software from the start and scaling our efforts.

Building Trust Through Transparency

Our commitment to reliability testing isn't just about internal quality assurance – it's about building trust with our customers and the broader community. That's why we:

  • Publish our testing methodologies and results openly
  • Share our learnings through blog posts and conference talks
  • Provide tools like zbchaos as open source for the community

Conclusion

Reliability testing at Camunda has evolved from simple benchmarks to a comprehensive practice that combines load testing, chaos engineering, and end-to-end validation. This evolution reflects our understanding that true reliability emerges from systematic testing across multiple dimensions.

For our customers, this means confidence that Camunda will perform reliably under their most demanding workloads. For engineers interested in joining our team, it represents an opportunity to work with cutting-edge testing practices at scale.

As we continue to invest in reliability testing, we remain committed to transparency and sharing our learnings with the community. After all, the reliability of process automation platforms isn't just a technical challenge – it's fundamental to the digital transformation of businesses worldwide.


Interested in learning more about our reliability testing practices? Check out our detailed documentation, explore our chaos engineering experiments, or reach out to discuss how Camunda's reliability testing ensures your critical processes run smoothly.

Stress testing Camunda

· 12 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's chaos experiment, we focused on stress-testing the Camunda 8 orchestration cluster under high-load conditions. We simulated a large number of concurrent process instances to evaluate the performance of processing and system reliability.

Due to our recent work in supporting load tests for different versions, we were able to compare how different Camunda versions handle stress.

TL;DR: Overall, we saw that all versions of the Camunda 8 orchestration cluster (with focus on the processing) are robust and can handle high loads effectively and reliably. In consideration of throughput and latency, with similar resource allocation among the brokers, 8.7.x outperforms other versions. If we consider our streamlined architecture (which now contains more components in a single application) and align the resources for 8.8.x, it can achieve similar throughput levels as 8.7.x, while maintaining significantly lower latency (a factor of 2). An overview of the results can be found in the Results section below.

info

[Update: 28.11.2025]

After the initial analysis, we conducted further experiments with 8.8 to understand why the measured processing performance was lower compared to 8.7.x. The blog post (including TL;DR) has been updated with the new findings in the section Further Experiments below.

Resiliency against ELS unavailability

· 11 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

Due to recent initiatives and architecture changes, we coupled us even more against the secondary storage (often Elasticsearch, but can also be OpenSearch or in the future RDBMS).

We now have one single application to run Webapps, Gateway, Broker, Exporters, etc., together. Including the new Camunda Exporter exporting all necessary data to the secondary storage. On bootstrap we need to create an expected schema, so our components work as expected, allowing Operate and Tasklist Web apps to consume the data and the exporter to export correctly. Furthermore, we have a new query API (REST API) allowing the search for available data in the secondary storage.

We have seen in previous experiments and load tests that unavailable ELS and not properly configured replicas can cause issues like the exporter not catching up or queries not succeeding. See related GitHub issue.

In todays chaos day, we want to play around with the replicas setting of the indices, which can be set in the Camunda Exporter (which is in charge of writing the data to the secondary storage).

TL;DR; Without the index replicas set, the Camunda Exporter is directly impacted by ELS node restarts. The query API seem to handle this transparently, but changing the resulting data. Having the replicas set will cause some performance impact, as the ELS node might run into CPU throttling (as they have much more to do). ELS slowing down has an impact on processing as well due to our write throttling mechanics. This means we need to be careful with this setting, while it gives us better availability (CamundaExporter can continue when ELS nodes restart), it might come with some cost.

Follow up REST API performance

· 26 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

Investigating REST API performance

This post collates the experiments, findings, and lessons learned during the REST API performance investigation.

There wasn't one explicit root cause identified. As it is often the case with such performance issues, it is the combination of several things.

Quint essence: REST API is more CPU intense/heavy than gRPC. You can read more about this in the conclusion part. We have discovered ~10 issues we have to follow up with, where at least 2-3 might have a significant impact in the performance. Details can be found in the Discovered issues section

Performance of REST API

· 8 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In today's Chaos day we wanted to experiment with the new REST API (v2) as a replacement for our previous used gRPC API.

Per default, our load tests make use of the gRPC, but as we want to make REST API the default and release this fully with 8.8, we want to make sure to test this accordingly in regard to reliability.

TL;DR; We observed severe performance regression when using the REST API, even when job streaming is in use by the job workers (over gRPC). Our client seems to have a higher memory consumption, which caused some instabilities in our tests as well. With the new API, we lack certain observability, which makes it harder to dive into certain details. We should investigate this further and find potential bottlenecks and improvements.

general

How does Zeebe behave with NFS

· 15 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

This week, we (Lena, Nicolas, Roman, and I) held a workshop where we looked into how Zeebe behaves with network file storage (NFS).

We ran several experiments with NFS and Zeebe, and messing around with connectivity.

TL;DR; We were able to show that NFS can handle certain connectivity issues, just causing Zeebe to process slower. IF we completely lose the connection to the NFS server, several issues can arise, like IOExceptions on flush (where RAFT goes into inactive mode) or SIGBUS errors on reading (like replay), causing the JVM to crash.

Lower memory consumption of Camunda deployment

· 10 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

I'm back to finally do some load testing again.

In the past months, we have changed our architecture. This was to deploy instead all of our components as a separate deployment, we now have one single statefulset. This statefulset is running our single Camunda standalone application, combining all components together.

simpler deployment

More details on this change we will share on a separate blog post. For simplicity, in our load tests (benchmark helm charts), we combined all the resources we had split over multiple deployments together, see related PR #213.

We are currently running our test with the following resources by default:

    Limits:
cpu: 2
memory: 12Gi
Requests:
cpu: 2
memory: 6Gi

In today's Chaos day, I want to look into our resource consumption and whether we can reduce our used requests and limits.

TL;DR; We have focused on experimenting with different memory resources, and were able to show that we can reduce the used memory by 75%, and our previous provisioned resources by more than 80% for our load tests.

News from Camunda Exporter project

· 4 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In this Chaos day, we want to verify the current state of the exporter project and run benchmarks with it. Comparing with a previous version (v8.6.6) should give us a good hint on the current state and potential improvements.

TL;DR; The latency of user data availability has improved due to our architecture change, but we still need to fix some bugs before our planned release of the Camunda Exporter. This experiment allows us to detect three new bugs, fixing this should allow us to make the system more stable.

Impact of Camunda Exporter on processing performance

· 5 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

In our last Chaos day we experimented with the Camunda Exporter MVP. After our MVP we continued with Iteration 2, where we migrated the Archiver deployments and added a new Migration component (allows us to harmonize indices).

Additionally, some fixes and improvements have been done to the realistic benchmarks that should allow us to better compare the general performance with a realistic good performing benchmark.

Actually, this is what we want to explore and experiment with today.

  • Does the Camunda Exporter (since the last benchmark) impact performance of the overall system?
    • If so how?
  • How can we potentially mitigate this?

TL;DR; Today's, results showed that enabling the Camunda Exporter causes a 25% processing throughput drop. We identified the CPU as a bottleneck. It seems to be mitigated by either adjusting the CPU requests or removing the ES exporter. With these results, we are equipped to make further investigations and decisions.

Camunda Exporter MVP

· 7 min read
Christopher Kujawa
Principal Software Engineer @ Camunda

After a long pause, I come back with an interesting topic to share and experiment with. Right now we are re-architecture Camunda 8. One important part (which I'm contributing to) is to get rid of Webapps Importer/Archivers and move data aggregation closer to the engine (inside a Zeebe Exporter).

Today, I want to experiment with the first increment/iteration of our so-called MVP. The MVP targets green field installations where you simply deploy Camunda (with a new Camunda Exporter enabled) without Importers.

TL;DR; All our experiments were successful. The MVP is a success, and we are looking forward to further improvements and additions. Next stop Iteration 2: Adding Archiving historic data and preparing for data migration (and polishing MVP).

Camunda Exporter

The Camunda Exporter project deserves a complete own blog post, here is just a short summary.

Our current Camunda architecture looks something like this (simplified).

current

It has certain challenges, like:

  • Space: duplication of data in ES
  • Maintenance: duplication of importer and archiver logic
  • Performance: Round trip (delay) of data visible to the user
  • Complexity: installation and operational complexity (we need separate pods to deploy)
  • Scalability: The Importer is not scalable in the same way as Zeebe or brokers (and workload) are.

These challenges we obviously wanted to overcome and the plan (as mentioned earlier) is to get rid of the need of separate importers and archivers (and in general to have separate application; but this is a different topic).

The plan for this project looks something like this:

plan

We plan to:

  1. Harmonize the existing indices stored in Elasticsearch/Opensearch
    • Space: Reduce the unnecessary data duplication
  2. Move importer and archiver logic into a new Camunda exporter
    • Performance: This should allow us to reduce one additional hop (as we don't need to use ES/OS as a queue)
    • Maintenance: Indices and business logic is maintained in one place
    • Scalability: With this approach, we can scale with partitions, as Camunda Exporters are executed for each partition separately (soon partition scaling will be introduced)
    • Complexity: The Camunda Exporter will be built-in and shipped with Zeebe/Camunda 8. No additional pod/application is needed.

Note: Optimize is right now out of scope (due to time), but will later be part of this as well.

MVP

After we know what we want to achieve what is the Minimum viable product (MVP)?

We have divided the Camunda Exporter in 3-4 iterations. You can see and read more about this here.

The first iteration contains the MVP (the first breakthrough). Providing the Camunda Exporter with the basic functionality ported from the Operate and Tasklist importers, writing into harmonized indices.

The MVP is targeting green field installations (clean installations) of Camunda 8 with Camunda Exporter without running the old Importer (no data migration yet),

mvp