Improve Operate import latency
In our last Chaos Day we experimented with Operate and different load (Zeebe throughput). We observed that a higher load caused a lower import latency in Operate. The conclusion was that it might be related to Zeebe's exporting configuration, which is affected by a higher load.
In today's chaos day we want to verify how different export and import configurations can affect the importing latency.
TL;DR; We were able to decrease the import latency by ~35% (from 5.7 to 3.7 seconds), by simply reducing the bulk.delay
configuration. This worked on low load and even higher load, without significant issues.
Background
In the following I want to briefly explain a bit more the background of how exporting and importing play together. If you are already aware feel free to jump to the next section.
To understand how the importing of Operate is affected and works, we first have to take a look at Zeebe.
Zeebe exports data to Elasticsearch via its Elasticsearch Exporter. The exporter collects data before sending it to Elasticsearch in bulk requests. The amount of data, which is collected in the exporter, is configurable and by default set to 1000 records per batch/bulk. Additionally, there is a memory limit which is taken into account that is set to 10 MB. When the bulk request is reaching that size, the request is sent as well. To cover cases of low load, there is a delay option, which is per default set to 5 seconds. This means, that every 5 seconds the bulk request is sent, even if it is not full.
This explains also the results from our last Chaos Day, where the import latency was around 5 seconds on a lower load.
In the following, we have written down the sequence of steps a command has to take, and its resulting events until it is visible to the user in Operate. This should allow to better understand how and by what the import latency is affected, and what we might want to tune and experiment further.
User Command is sent to Gateway
-->Gateway sents Command to the right Broker
---->Broker processes command and produces events
------>Events are exported by Broker to ES (worst case: 5s flush)
-------->ES refreshes after one second
---------->Operate import processing/rewriting data
------------>ES refreshes after one second
-------------->Operate can query the data -> User can see the data
About Elasticsearch and its default refresh configuration, etc. you can read here.
Based on this, we know we have the following minimum delay:
delay = 2 seconds (due to ES refresh)
+ (5 seconds from exporter on low load)
+ network delay
+ processing delay
+ Exporter and Operate data un/marshaling/processing
Today, we will experiment with the Elasticsearch exporter configurations to improve the import latency.
Chaos Experiment
As we have seen in a previous chaos day high load affects the importing latency positively. The thesis is that this is due to the export flush delay, which is mostly affecting the exporting on lower load.
Today we want to prove the following:
Hypothesis
When we set the exporting/flush delay to a lower value (ex. 1 second), we are improving the import latency for lower load scenarios without affecting the system negatively.
We can define the following unknowns
, that we want to explore further as well:
- It is not clear how lower flush delay affects the system on higher loads.
- It is not clear how smaller values (under 1 second) for the flush delay affect the system, no matter of high or low load.
Expected
- When we set the exporting/flush delay to a lower value (ex. 1 second), we are improving the import latency for lower load scenarios without affecting the system negatively.
- When we set the exporting/flush delay to a lower value (ex. 1 second), we are improving the import latency for higher load scenarios, but decreasing the import throughput
- When we set the exporting/flush delay to a small value (under 1 second), we are affecting the import throughput negatively
Actual
As always, we set a base installation up to compare against. The load is moderate-to-low (15 PI/s). We can compare the data from the last chaos day here as well.
Base: Helm install command
helm install $(releaseName) $(chartPath) --render-subchart-notes
--set global.image.tag=ck-operate-benchmark-1ad8f375
--set camunda-platform.zeebe.image.repository=gcr.io/zeebe-io/zeebe
--set camunda-platform.zeebe.image.tag=ck-operate-benchmark-1ad8f375
--set camunda-platform.zeebeGateway.image.repository=gcr.io/zeebe-io/zeebe
--set camunda-platform.zeebeGateway.image.tag=ck-operate-benchmark-1ad8f375
--set starter.rate=5
--set worker.replicas=1
--set timer.replicas=1
--set timer.rate=5
--set publisher.replicas=1
--set publisher.rate=5
--set camunda-platform.operate.enabled=true
--set camunda-platform.operate.image.repository=gcr.io/zeebe-io/operate
--set camunda-platform.operate.image.tag=ck-operate-benchmark
--set camunda-platform.elasticsearch.master.persistence.size=128Gi
--set camunda-platform.zeebe.retention.minimumAge=1d \
We see similar results as on the last Chaos day.
We are able to import around 360 records per second, while Zeebe exports 413. Be aware that some are ignored by Operate. A record has on average a delay of 5.69 seconds from being written by Zeebe to being imported by Operate (and written into the end Elasticsearch index).
First experiment: Lower flush delay
When we set the exporting/flush delay to a lower value (ex. 1 second), we are improving the import latency for lower load scenarios without affecting the system negatively.
To reduce the exporter flush delay we use the following configuration:
exporters:
elasticsearch:
args:
bulk:
delay: 1
This can be set in our benchmark-helm directly via: --set zeebe.config.zeebe.broker.exporters.elasticsearch.args.bulk.delay=1
Lower flush delay: Helm install command
helm install $(releaseName) $(chartPath) --render-subchart-notes
--set global.image.tag=ck-operate-benchmark-1ad8f375
--set camunda-platform.zeebe.image.repository=gcr.io/zeebe-io/zeebe
--set camunda-platform.zeebe.image.tag=ck-operate-benchmark-1ad8f375
--set camunda-platform.zeebeGateway.image.repository=gcr.io/zeebe-io/zeebe
--set camunda-platform.zeebeGateway.image.tag=ck-operate-benchmark-1ad8f375
--set starter.rate=5
--set worker.replicas=1
--set timer.replicas=1
--set timer.rate=5
--set publisher.replicas=1
--set publisher.rate=5
--set camunda-platform.operate.enabled=true
--set camunda-platform.operate.image.repository=gcr.io/zeebe-io/operate
--set camunda-platform.operate.image.tag=ck-operate-benchmark
--set camunda-platform.elasticsearch.master.persistence.size=128Gi
--set camunda-platform.zeebe.retention.minimumAge=1d
--set zeebe.config.zeebe.broker.exporters.elasticsearch.args.bulk.delay=1
With setting the bulk.delay
to one second, we were able to reduce the import latency by ~2 seconds, from 5.69 to 3.68 seconds.
That is a 35% decrease, while other factors stay the same. We can observe that the throughput stays the same (while of course, the load is rather moderate-to-low).
This proved our first hypothesis from above.