Skip to main content

Follow up REST API performance

· 8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This blog post aims to summarize the investigation of the REST API performance, and give some hints and collections of what to improve.

REST API Metrics

One remark from the last experiments was that we do not have good insights for the REST API. Actually, we have the necessary metrics already exposed, but not yet available in our Dashboard.

This is currently prepared with #33907. Based on this, I was able to further investigate the REST API performance.

rest-api

What we can see is that our requests take on average more than 50ms to complete. This is causing our throughput to go down, we are not able to create 150 PI/s even.

Looking at a different Benchmark using gRPC, we can see that requests take 5-10ms to complete, and have a stable throughput

grpc-latency grpc

Due to the slower workers (on completion), we can see error reports of the workers not being able to accept further job pushes. This has been mentioned in the previous blog post as well. This, in consequence, means the worker sends FAIL commands for such jobs, to give them back. It has a cascading effect, as jobs are sent back and forth and impacting the general process instance execution latency (which grows up to 60s compared to 0.2 s).

Investigating Worker errors

In our previous experiments, we have seen the following exceptions

13:25:14.684 [pool-4-thread-3] WARN  io.camunda.client.job.worker - Worker benchmark failed to handle job with key 4503599628992806 of type benchmark-task, sending fail command to broker
java.lang.IllegalStateException: Queue full
at java.base/java.util.AbstractQueue.add(AbstractQueue.java:98) ~[?:?]
at java.base/java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:329) ~[?:?]
at io.camunda.zeebe.Worker.lambda$handleJob$1(Worker.java:122) ~[classes/:?]
at io.camunda.client.impl.worker.JobRunnableFactoryImpl.executeJob(JobRunnableFactoryImpl.java:45) ~[camunda-client-java-8.8.0-SNAPSHOT.jar:8.8.0-SNAPSHOT]
at io.camunda.client.impl.worker.JobRunnableFactoryImpl.lambda$create$0(JobRunnableFactoryImpl.java:40) ~[camunda-client-java-8.8.0-SNAPSHOT.jar:8.8.0-SNAPSHOT]
at io.camunda.client.impl.worker.BlockingExecutor.lambda$execute$0(BlockingExecutor.java:50) ~[camunda-client-java-8.8.0-SNAPSHOT.jar:8.8.0-SNAPSHOT]
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572) ~[?:?]
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317) ~[?:?]
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?

This is actually coming from the Worker (benchmark) application, as it is collecting all the request futures in a blocking queue.

As the performance is lower of handling requests, we collect more futures in the worker, causing to fill the queue. This in the end causes also to fail more jobs - causing even more work.

This allows explains why our workers have a higher memory consumption - we had to increase the worker memory to have a stable worker.

Profiling the System

With the previous results, we were encouraged to do some profiling. For the start we used JFR for some basic profiling.

You can do this by:

  kubectl exec -it "$1" -- jcmd 1 JFR.start duration=100s filename=/usr/local/camunda/data/flight-$(date +%d%m%y-%H%M).jfr

If the flight recording is done, you can copy the recording (via kubectl cp) and open it with Intellij (JMC didn't work for me)

first-profile

We see that the Spring filter chaining is dominating the profile, which is not unexpected as every request has gone through this chain. As this is a CPU based sampling profile, it is likely to be part of the profile. Still, it was something interesting to note and investigate.

Path pattern matching

Some research showed that it might be interesting to look into other path pattern matchers, as we use the (legacy) ant path matcher with regex.

Resources:

Gateway - Broker request latency

As we have such a high request-response latency, we have to find out where the time is spent. Ideally, we would have some sort of tracing (which we didn't have yet), or we would look at metrics that cover sub-parts of the system and the request-response cycle.

The REST API request-response latency metric, we can take it as the complete round trip, accepting the request on the gateway edge, converting it to a Broker request, sending it to the Broker, the Broker processes, sends the response back, etc.

Luckily, we have a metric that is covering the part of sending the Broker request (from the other side of the Gateway) to the Broker and wait for the response. See related code here.

The difference shows us that there is not a small overhead, meaning that actually the Gateway to Broker request-response is slower with REST as well, which is unexpected.

This can either be because different data is sent, or a different API is used, or some other execution mechanics, etc.

Using the same cluster and enabling the REST API later, we can see the immediate effect on performance.

rest-enabled

Request handling execution logic

A difference we have spotted with REST API and gRPC is the usage of the BrokerClient.

While we use on the gRPC side the BrokerClient with retries and direct response handling, on the REST API we use no retries and handle the response async with the ForkJoinPool.

As our benchmark clusters have two CPUs, meaning 1 Thread for the common ForkJoin thread pool we expected some contention on the thread.

For testing purposes, we increased the thread count by: -Djava.util.concurrent.ForkJoinPool.common.parallelism=8

In a profile we can see that more threads are used, but it doesn't change anything in the performance.

profile-inc-fork-join

rest-gw-metrics-after-increaese-thread-pool

The assumption was that we might not be able to handle the response in time with one thread, and this causes some contention also on the Gateway-Broker request-response cycle, but this is not the case.

We seem to spend time somewhere else or have a general resource contention issue. What we can see is that we have to work with more CPU throttling, then without REST API usage.

rest-api-cpu-throttling.png

Increasing the CPU resolves the general performance problem, hinting even more that we might have some issues with threads competing with resources, etc.

In the following screenshot, you see the test with 6 CPUs per Camunda application.

six-cpus

Compared to the previous run with 2 CPUs per Camunda application, where it had to fight with a lot of CPU throttling. The request-response latency was five times higher on average.

two-cpus

We have to further investigate this based on this knowledge.

Day 2: Profiling and Experimenting

Yesterday I was taking profiles with 100s, to reduce the noise. Still, we can see that the filter chain is taking ~40% of the complete profile.

jfr-10-minutes-filter-chain.png

When opening the JFR recording with JMC, we get some hints, related to context switches, CPU throttling (which we already know) and the inverted parallelism of GC (also mentioning high IO).

locks-and-contention-context-switch.png jfr-cpu-throttling-detection.png gc-ineffeciency-high-io.png

We have already seen in our metrics, for example, that we fight with high CPU throttling

rest-base-cpu

To better analyze (and circumvent that we have no tracing), I added some more metrics to understand where time is spent. Furthermore, I created a temporary dashboard to break down where time is spent.

When we look at the base with gRPC (taking our weekly benchmarks), we can see all latencies are low, and mostly under 5 ms.

grpc-break-down.png

As soon as we enable the REST API, we can see the latencies go up. The most significant increase we see is in the job activations.

rest-break-down

Fascinating is that the write to process latency, the time from acceptance by the CommandAPI until the processor processes this command, also increases.

Virtual threads

To remove some thoughts about potential IO and CPU contention, I experimented with virtual threads, which we can easily enable for Spring.

I set the following system property on the statefulset.

-Dspring.threads.virtual.enabled=true

Taking a new profile, we can see that all the http threads are gone, but still the filtering is prominent.

jfr-virtual-threads.png

Checking our metrics break-down again we see there is no benefit here.

virtual-threads-break-down.png

Direct handling

Investigating the code basis, we saw several times #handleAsync without using an extra executor, causing to use of the ForkJoinPool (as mentioned the other day). One idea was to directly handle the future completions, meaning the response handling, etc.

We didn't observe any benefits with this.

direct-handling-breakdown.png

In the JFR recording, we can see that less Threads are used, but the Spring filter chain is also super prominent. direct-handling-v2-profile-too-much-filtering.png

Spring PathPattern parser for MVC

At the end of the day I finally came to try the PathPattern parser. As mentioned the other day, it is recommended to use it over the legacy AntPathMatcher.

The migration was rather simple, we can replace the spring.mvc.pathmatch.matching-strategy=ant_path_matcher with spring.mvc.pathmatch.matching-strategy=path_pattern_parser, we only had to fix some occurrences of regex combinations with **, as it is only allowed to have ** at the end (no regex after).

See related branch ck-pattern-path-parse.

path-pattern-breakdown

We were able to reduce the latencies by half, which also allowed us to bring back our throughput.

path-pattern-general.png

I did a cross-check with the current SNAPSHOT, and weirdly the SNAPSHOT now behaved the same. I will run this for a while to see the results, as it might fail after certain period of time. As this might be also related to where the pods are scheduled (noisy neighbours etc.)

rest-base-v2-breakdown.png rest-base-v2-general.png

Combination of direct handle and PathPattern

On top of the above, I combined the direct handling and PathPattern usage, and this gave us the best results.

The latencies are only two times higher than gRPC vs before 5 times (and more).

combination-of-all-breakdown.png

The throttling of the CPU was reduced by half as well.

combination-of-all-cpu.png

This gives a great stable throughput again.

combination-of-all-general.png

Performance of REST API

· 7 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In today's Chaos day we wanted to experiment with the new REST API (v2) as a replacement for our previous used gRPC API.

Per default, our load tests make use of the gRPC, but as we want to make REST API the default and release this fully with 8.8, we want to make sure to test this accordingly in regard to reliability.

TL;DR; We observed severe performance regression when using the REST API, even when job streaming is in use by the job workers (over gRPC). Our client seems to have a higher memory consumption, which caused some instabilities in our tests as well. With the new API, we lack certain observability, which makes it harder to dive into certain details. We should investigate this further and find potential bottlenecks and improvements.

general

How does Zeebe behave with NFS

· 13 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

This week, we (Lena, Nicolas, Roman, and I) held a workshop where we looked into how Zeebe behaves with network file storage (NFS).

We ran several experiments with NFS and Zeebe, and messing around with connectivity.

TL;DR; We were able to show that NFS can handle certain connectivity issues, just causing Zeebe to process slower. IF we completely lose the connection to the NFS server, several issues can arise, like IOExceptions on flush (where RAFT goes into inactive mode) or SIGBUS errors on reading (like replay), causing the JVM to crash.

Lower memory consumption of Camunda deployment

· 9 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

I'm back to finally do some load testing again.

In the past months, we have changed our architecture. This was to deploy instead all of our components as a separate deployment, we now have one single statefulset. This statefulset is running our single Camunda standalone application, combining all components together.

simpler deployment

More details on this change we will share on a separate blog post. For simplicity, in our load tests (benchmark helm charts), we combined all the resources we had split over multiple deployments together, see related PR #213.

We are currently running our test with the following resources by default:

    Limits:
cpu: 2
memory: 12Gi
Requests:
cpu: 2
memory: 6Gi

In today's Chaos day, I want to look into our resource consumption and whether we can reduce our used requests and limits.

TL;DR; We have focused on experimenting with different memory resources, and were able to show that we can reduce the used memory by 75%, and our previous provisioned resources by more than 80% for our load tests.

News from Camunda Exporter project

· 4 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In this Chaos day, we want to verify the current state of the exporter project and run benchmarks with it. Comparing with a previous version (v8.6.6) should give us a good hint on the current state and potential improvements.

TL;DR; The latency of user data availability has improved due to our architecture change, but we still need to fix some bugs before our planned release of the Camunda Exporter. This experiment allows us to detect three new bugs, fixing this should allow us to make the system more stable.

Impact of Camunda Exporter on processing performance

· 5 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In our last Chaos day we experimented with the Camunda Exporter MVP. After our MVP we continued with Iteration 2, where we migrated the Archiver deployments and added a new Migration component (allows us to harmonize indices).

Additionally, some fixes and improvements have been done to the realistic benchmarks that should allow us to better compare the general performance with a realistic good performing benchmark.

Actually, this is what we want to explore and experiment with today.

  • Does the Camunda Exporter (since the last benchmark) impact performance of the overall system?
    • If so how?
  • How can we potentially mitigate this?

TL;DR; Today's, results showed that enabling the Camunda Exporter causes a 25% processing throughput drop. We identified the CPU as a bottleneck. It seems to be mitigated by either adjusting the CPU requests or removing the ES exporter. With these results, we are equipped to make further investigations and decisions.

Camunda Exporter MVP

· 7 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

After a long pause, I come back with an interesting topic to share and experiment with. Right now we are re-architecture Camunda 8. One important part (which I'm contributing to) is to get rid of Webapps Importer/Archivers and move data aggregation closer to the engine (inside a Zeebe Exporter).

Today, I want to experiment with the first increment/iteration of our so-called MVP. The MVP targets green field installations where you simply deploy Camunda (with a new Camunda Exporter enabled) without Importers.

TL;DR; All our experiments were successful. The MVP is a success, and we are looking forward to further improvements and additions. Next stop Iteration 2: Adding Archiving historic data and preparing for data migration (and polishing MVP).

Camunda Exporter

The Camunda Exporter project deserves a complete own blog post, here is just a short summary.

Our current Camunda architecture looks something like this (simplified).

current

It has certain challenges, like:

  • Space: duplication of data in ES
  • Maintenance: duplication of importer and archiver logic
  • Performance: Round trip (delay) of data visible to the user
  • Complexity: installation and operational complexity (we need separate pods to deploy)
  • Scalability: The Importer is not scalable in the same way as Zeebe or brokers (and workload) are.

These challenges we obviously wanted to overcome and the plan (as mentioned earlier) is to get rid of the need of separate importers and archivers (and in general to have separate application; but this is a different topic).

The plan for this project looks something like this:

plan

We plan to:

  1. Harmonize the existing indices stored in Elasticsearch/Opensearch
    • Space: Reduce the unnecessary data duplication
  2. Move importer and archiver logic into a new Camunda exporter
    • Performance: This should allow us to reduce one additional hop (as we don't need to use ES/OS as a queue)
    • Maintenance: Indices and business logic is maintained in one place
    • Scalability: With this approach, we can scale with partitions, as Camunda Exporters are executed for each partition separately (soon partition scaling will be introduced)
    • Complexity: The Camunda Exporter will be built-in and shipped with Zeebe/Camunda 8. No additional pod/application is needed.

Note: Optimize is right now out of scope (due to time), but will later be part of this as well.

MVP

After we know what we want to achieve what is the Minimum viable product (MVP)?

We have divided the Camunda Exporter in 3-4 iterations. You can see and read more about this here.

The first iteration contains the MVP (the first breakthrough). Providing the Camunda Exporter with the basic functionality ported from the Operate and Tasklist importers, writing into harmonized indices.

The MVP is targeting green field installations (clean installations) of Camunda 8 with Camunda Exporter without running the old Importer (no data migration yet),

mvp

Optimizing cluster sizing using a real world benchmark

· 7 min read
Rodrigo Lopes
Associate Software Engineer @ Zeebe

Our first goal of this experiment is to use a benchmarks to derive new optimized cluster configuration that can handle at least 100 tasks per second, while maintaining low backpressure and low latency.

For our experiment, we use a newly defined realistic benchmark (with a more complex process model). More about this in a separate blog post.

The second goal is to scale out optimized cluster configuration resources linearly and see if the performance scales accordingly.

TL;DR;

We used a realistic benchmark to derive a new cluster configuration based on previous requirements.

When we scale this base configuration linearly we see that the performance increases almost linearly as well, while maintaining low backpressure and low latency.

Chaos Experiment

Expected

We expect that we can find a cluster configuration that can handle at 100 tasks second to be significantly reduced in resources in relation to our smaller clusters (G3-S HA Plan) since these can process significantly above our initial target.

We also expect that we can scale this base configuration linearly, and that the processing tasks rate to grow initially a bit faster than linearly due to the lower relative overhead, and if we keep expanding further to flatten due to the partition count being a bottleneck.

Actual

Minimal Requirements for our Cluster

Based on known customer usage, and our own previous experiments, we determined that the new cluster would need to create and complete a baseline of 100 tasks per second, or about 8.6 million tasks per day.

Other metrics that we want to preserve and keep track are the backpressure to preserve user experience, guarantee that exporting speed can keep up with the processing speed, write-to-import latency which tells us how long it takes for a record to be written to being imported by our other apps such as the operator.

Reverse Engineering the Cluster Configuration

For our new configurations the only resources that we are going to change are the ones relevant to the factors described above. These are the resources allocated to our zeebe-brokers, gateway and elasticSearch.

Our starting point in resources was the configuration for our G3-S HA Plan as this already had the capability to significantly outperform the current goal of 100 tasks per second.

The next step was to deploy our realistic benchmark, with a payload of 5 customer disputes per instance and start 7 instances per second, this generated approximately 120 tasks per second (some buffer was added to guarantee performance).

After this we reduced the resources iteratively until we saw any increase in backpressure, given that no there was no backlog of records, and no significant increase in the write to import latency.

The results for our new cluster are specified bellow in the tables, where our starting cluster configuration is the G3-S HA Plan and the new configuration cluster is the G3 - BasePackage HA.

G3-S HACPU LimitMemory Limit in GB
operate22
operate.elasticsearch66
optimize22
tasklist22
zeebe.broker2.8812
zeebe.gateway0.90.8
TOTAL15.7824.8
G3 - BasePackage HACPU LimitMemory Limit in GB
operate11
operate.elasticsearch34.5
optimize11.6
tasklist11
zeebe.broker1.54.5
zeebe.gateway0.61
TOTAL8.113.6
Reduction in Resources for our Optimized Cluster
CPU Reduction (%)Memory Reduction (%)
zeebe.broker47.9262.5
zeebe.gateway33.33-25.0
operate.elasticsearch50.0025.0

Total cluster reduction:

G3-S HAG3 - BasePackage HAReduction (%)
CPU Limits15.788.149
Memory Limits24.813.645

The process of reducing the hardware requirements was donne initially by scaling down the resources of the zeebe-broker, gateway and elasticSearch. The other components were left untouched, as they had no impact in our key metrics, and were scaled down later in separate experiences to maintain user experience.

Scaling out the Cluster

Now for the scaling procedure we intend to see if we can linearly increase the allocated resources and having a corresponding performance increase, while keeping the backpressure low, low latency, and user experience.

For this we started with the G3 - BasePackage HA configuration and incremented the load again until we saw any increase in backpressure, capture our key metrics and repeated the process for the cluster configuration resources respectively multiplied by 2x, 3x, and 4x.

This means that the resources allocated for our clusters were:

Base 1xBase 2xBase 3xBase 4x
CPU Limits8.717.426.134.8
Memory Limits14.929.844.759.6

The results in the table bellow show the performance of our several cluster configurations:

Base 1xBase 2xBase 3xBase 4x
Process Instances/s7122327
Tasks/s125217414486
Average Backpressure2%2%3%6%
Write-to-Import Latency90s120s150s390s
Write-to-Process Latency140ms89ms200ms160ms
Records Processed Rate25004700780011400
Records Exported Rate2100390065009200

This first observations is that the performance scales particularly well by just adding more resources to the cluster, particularly for a linear increase of the resources the performance as measured by tasks completed increases slightly less than linearly (comparing the 1x and 4x task/s we get 388% the initial rate).

This a very good result as it means that we can scale our system linearly (at least initially) to handle the expected increase in loads.

Importantly, the backpressure is kept low, and the write-to-import latency only increases significantly if we leave the cluster running at max rate for long periods of time. For slightly lower rates the write-to-import latency is kept in the single digits of seconds or lower tens. This might imply that a these sustained max rates, the amount records generated starts to be too much for either ElasticSearch or our web apps that import these records to handle. Some further investigation could be done here to investigate the bottleneck.

Another metric also relevant but not shown in this table is the backlog of records not exported, which kept at almost null through all the experiments conducted.

Bugs found

During the initial tests, we had several OOM errors in the gateways pods. After some investigation, we found that this was exclusive to the Camunda 8. 6.0 version, which consumes more memory in the gateway than the previous versions. This explains why the gateway memory limits were the only resource that was increased in the new reduced cluster configuration.

Improve Operate import latency

· 9 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

In our last Chaos Day we experimented with Operate and different load (Zeebe throughput). We observed that a higher load caused a lower import latency in Operate. The conclusion was that it might be related to Zeebe's exporting configuration, which is affected by a higher load.

In today's chaos day we want to verify how different export and import configurations can affect the importing latency.

TL;DR; We were able to decrease the import latency by ~35% (from 5.7 to 3.7 seconds), by simply reducing the bulk.delay configuration. This worked on low load and even higher load, without significant issues.

Operate load handling

· 8 min read
Christopher Kujawa
Chaos Engineer @ Zeebe

🎉 Happy to announce that we are broadening the scope of our Chaos days, to look holistically at the whole Camunda Platform, starting today. In the past Chaos days we often had a close look (or concentrated mostly) at Zeebe performance and stability.

Today, we will look at the Operate import performance and how Zeebe processing throughput might affect (or not?) the throughput and latency of the Operate import. Is it decoupled as we thought?

The import time is an important metric, representing the time until data from Zeebe processing is visible to the User (excluding Elasticsearch's indexing). It is measured from when the record is written to the log, by the Zeebe processor, until Operate reads/imports it from Elasticsearch and converts it into its data model. We got much feedback (and experienced this on our own) that Operate is often lagging behind or is too slow, and of course we want to tackle and investigate this further.

The results from this Chaos day and related benchmarks should allow us to better understand how the current importing of Operate performs, and what its affects. Likely it will be a series of posts to investigate this further. In general, the data will give us some guidance and comparable numbers for the future to improve the importing time. See also related GitHub issue #16912 which targets to improve such.

TL;DR; We were not able to show that Zeebe throughput doesn't affect Operate importing time. We have seen that Operate can be positively affected by the throughput of Zeebe. Surprisingly, Operate was faster to import if Zeebe produced more data (with a higher throughput). One explanation of this might be that Operate was then less idle.