Skip to content

Extend styx metrics + grafana dashboard & implement additional client load patterns#36

Open
Erhan1706 wants to merge 3 commits intodelftdata:mainfrom
Erhan1706:extensive-metrics
Open

Extend styx metrics + grafana dashboard & implement additional client load patterns#36
Erhan1706 wants to merge 3 commits intodelftdata:mainfrom
Erhan1706:extensive-metrics

Conversation

@Erhan1706
Copy link
Copy Markdown
Contributor

@Erhan1706 Erhan1706 commented Apr 14, 2026

In my previous meeting Oto and Asterios mentioned that some bachelor students will start their thesis working on styx, and the tooling I developed in my fork might be useful for them. This PR contributes a portion of those changes, focused on metrics and load generation.

Summary

  • Extended metric collection & Grafana dashboard — added per-phase and per-operator metrics, backlog, input rate, and visual annotations for migration start/end events.
Screenshot From 2026-04-14 13-42-38
  • Configurable client load patterns — clients previously only supported a constant TPS for the entire experiment duration. This adds support for dynamic patterns: increasing, decreasing, cosine, random, and step.

Changes:

  • Implemented configurable load patterns, the code was mostly adjusted from: https://github.com/delftdata/espa-autoscaling
    • Increasing (has some randomness to decrease at some points, but it's biased to always have an increasing trend over time)
    • Decreasing (similar to increasing)
    • Cosine
    • Random
    • Step pattern (4000 for first 30 seconds, 5000 for next 30, etc...)

Usage: ./scripts/run_experiment.sh dhr 1000 100000 2 0 1 200 results 10 1000 cosine
Can also be left blank (for backwards compatibility) and it will use the constant input generator.

  • Implemented the following new metrics:
    • Cpu and memory metric for each of the aria phases (1st run, sync, conflict resolution, etc...)
    • Per-operator metrics (how many times each operator is called per second, latency of each operator)
    • Consumption rate of the workers (to see the approximate input rate of the workers, now that there are different input patterns)
    • Backlog (lag)
    • Num of live workers
    • Num of transactions commited in lock-free and fallback phases & abort count for each phase
    • Percentage of the epoch that was spent on CPU tasks and IO tasks.
    • Annotations that mark the start and end of migrations

Metric overhead

To figure out the overhead of the extra metrics, I tried to do a mini benchmark by running the same workloads for a version with the extra metrics and a version without.

  • In terms of the cpu% the overhead saw increases of 1-4% in the worst cases, but overall the results were not statistically significant to be attributed as overhead.
  • In terms of transaction latency there is quite a bit of overhead, a significant increase of 70 to 100%, roughly doubling the transaction latency in the worst case.
    For example these are the latency results for 4 runs of ycsbt with 2 workers, input rate=3000, n_keys=100000, for 4 minutes:

[With metrics]
run1.csv (16 samples, mean=8.00)
run2.csv (16 samples, mean=7.92)
run3.csv (15 samples, mean=7.29)
run4.csv (15 samples, mean=8.38)

[Without metrics]
run1.csv (16 samples, mean=4.14)
run2.csv (16 samples, mean=3.68)
run3.csv (16 samples, mean=3.29)
run4.csv (15 samples, mean=4.46)

The metrics are quite extensive, some of them I am currently not using (like the per-phase metrics) for my autoscaling policy in my fork, but I still decided to include them for completeness. If the overhead is too much, I can remove some of them to try to reduce the costs a bit.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 13.04348% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.38%. Comparing base (0d9fcab) to head (bced06e).

Files with missing lines Patch % Lines
worker/util/phase_resource_tracker.py 0.00% 63 Missing ⚠️
worker/util/epoch_metrics_builder.py 0.00% 48 Missing ⚠️
coordinator/worker_pool.py 14.28% 6 Missing ⚠️
styx-package/styx/common/metrics.py 0.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #36      +/-   ##
==========================================
- Coverage   88.14%   84.38%   -3.77%     
==========================================
  Files          45       48       +3     
  Lines        2616     2754     +138     
==========================================
+ Hits         2306     2324      +18     
- Misses        310      430     +120     
Flag Coverage Δ
coordinator 92.30% <14.28%> (-1.10%) ⬇️
integration 9.04% <5.79%> (-0.18%) ⬇️
styx-package 84.85% <66.66%> (-0.12%) ⬇️
worker 72.15% <9.01%> (-11.35%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
styx-package/styx/common/operator.py 100.00% <100.00%> (ø)
worker/ingress/styx_kafka_ingress.py 93.81% <100.00%> (+0.79%) ⬆️
styx-package/styx/common/metrics.py 0.00% <0.00%> (ø)
coordinator/worker_pool.py 89.84% <14.28%> (-4.38%) ⬇️
worker/util/epoch_metrics_builder.py 0.00% <0.00%> (ø)
worker/util/phase_resource_tracker.py 0.00% <0.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Erhan1706 Erhan1706 changed the title Extensive metrics Extend styx metrics + grafana dashboard & implement additional client load patterns Apr 14, 2026
@kPsarakis
Copy link
Copy Markdown
Member

Hey @Erhan1706 ! Thanks a lot for the contribution! I would need some time to review this. In the meantime, could you check why the e2e tests are failing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants