Extend styx metrics + grafana dashboard & implement additional client load patterns by Erhan1706 · Pull Request #36 · delftdata/styx

Erhan1706 · 2026-04-14T19:39:10Z

In my previous meeting Oto and Asterios mentioned that some bachelor students will start their thesis working on styx, and the tooling I developed in my fork might be useful for them. This PR contributes a portion of those changes, focused on metrics and load generation.

Summary

Extended metric collection & Grafana dashboard — added per-phase and per-operator metrics, backlog, input rate, and visual annotations for migration start/end events.

Configurable client load patterns — clients previously only supported a constant TPS for the entire experiment duration. This adds support for dynamic patterns: increasing, decreasing, cosine, random, and step.

Changes:

Implemented configurable load patterns, the code was mostly adjusted from: https://github.com/delftdata/espa-autoscaling
- Increasing (has some randomness to decrease at some points, but it's biased to always have an increasing trend over time)
- Decreasing (similar to increasing)
- Cosine
- Random
- Step pattern (4000 for first 30 seconds, 5000 for next 30, etc...)

Usage: ./scripts/run_experiment.sh dhr 1000 100000 2 0 1 200 results 10 1000 cosine
Can also be left blank (for backwards compatibility) and it will use the constant input generator.

Implemented the following new metrics:
- Cpu and memory metric for each of the aria phases (1st run, sync, conflict resolution, etc...)
- Per-operator metrics (how many times each operator is called per second, latency of each operator)
- Consumption rate of the workers (to see the approximate input rate of the workers, now that there are different input patterns)
- Backlog (lag)
- Num of live workers
- Num of transactions commited in lock-free and fallback phases & abort count for each phase
- Percentage of the epoch that was spent on CPU tasks and IO tasks.
- Annotations that mark the start and end of migrations

Metric overhead

To figure out the overhead of the extra metrics, I tried to do a mini benchmark by running the same workloads for a version with the extra metrics and a version without.

In terms of the cpu% the overhead saw increases of 1-4% in the worst cases, but overall the results were not statistically significant to be attributed as overhead.
In terms of transaction latency there is quite a bit of overhead, a significant increase of 70 to 100%, roughly doubling the transaction latency in the worst case.
For example these are the latency results for 4 runs of ycsbt with 2 workers, input rate=3000, n_keys=100000, for 4 minutes:

[With metrics]
run1.csv (16 samples, mean=8.00)
run2.csv (16 samples, mean=7.92)
run3.csv (15 samples, mean=7.29)
run4.csv (15 samples, mean=8.38)

[Without metrics]
run1.csv (16 samples, mean=4.14)
run2.csv (16 samples, mean=3.68)
run3.csv (16 samples, mean=3.29)
run4.csv (15 samples, mean=4.46)

The metrics are quite extensive, some of them I am currently not using (like the per-phase metrics) for my autoscaling policy in my fork, but I still decided to include them for completeness. If the overhead is too much, I can remove some of them to try to reduce the costs a bit.

codecov · 2026-04-14T19:40:48Z

Codecov Report

❌ Patch coverage is 13.04348% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.38%. Comparing base (0d9fcab) to head (bced06e).

Files with missing lines	Patch %	Lines
worker/util/phase_resource_tracker.py	0.00%	63 Missing ⚠️
worker/util/epoch_metrics_builder.py	0.00%	48 Missing ⚠️
coordinator/worker_pool.py	14.28%	6 Missing ⚠️
styx-package/styx/common/metrics.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
- Coverage   88.14%   84.38%   -3.77%     
==========================================
  Files          45       48       +3     
  Lines        2616     2754     +138     
==========================================
+ Hits         2306     2324      +18     
- Misses        310      430     +120

Flag	Coverage Δ
coordinator	`92.30% <14.28%> (-1.10%)`	⬇️
integration	`9.04% <5.79%> (-0.18%)`	⬇️
styx-package	`84.85% <66.66%> (-0.12%)`	⬇️
worker	`72.15% <9.01%> (-11.35%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
styx-package/styx/common/operator.py	`100.00% <100.00%> (ø)`
worker/ingress/styx_kafka_ingress.py	`93.81% <100.00%> (+0.79%)`	⬆️
styx-package/styx/common/metrics.py	`0.00% <0.00%> (ø)`
coordinator/worker_pool.py	`89.84% <14.28%> (-4.38%)`	⬇️
worker/util/epoch_metrics_builder.py	`0.00% <0.00%> (ø)`
worker/util/phase_resource_tracker.py	`0.00% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kPsarakis · 2026-04-14T19:57:28Z

Hey @Erhan1706 ! Thanks a lot for the contribution! I would need some time to review this. In the meantime, could you check why the e2e tests are failing?

Erhan1706 added 2 commits April 14, 2026 11:57

Improve worker metric collection + improve grafana dashboards

846080d

Implement load generators for different types of load patterns

d8dd9fb

Erhan1706 changed the title ~~Extensive metrics~~ Extend styx metrics + grafana dashboard & implement additional client load patterns Apr 14, 2026

Add load_config_path to e2e tests

bced06e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend styx metrics + grafana dashboard & implement additional client load patterns#36

Extend styx metrics + grafana dashboard & implement additional client load patterns#36
Erhan1706 wants to merge 3 commits intodelftdata:mainfrom
Erhan1706:extensive-metrics

Erhan1706 commented Apr 14, 2026 •

edited

Loading

Uh oh!

codecov bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

kPsarakis commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Erhan1706 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes:

Metric overhead

Uh oh!

codecov bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kPsarakis commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Erhan1706 commented Apr 14, 2026 •

edited

Loading

codecov bot commented Apr 14, 2026 •

edited

Loading