Skip to content

[VQueues] Track per-entry wait stats and error counts#4894

Open
AhmedSoliman wants to merge 1 commit into
mainfrom
pr4894
Open

[VQueues] Track per-entry wait stats and error counts#4894
AhmedSoliman wants to merge 1 commit into
mainfrom
pr4894

Conversation

@AhmedSoliman
Copy link
Copy Markdown
Member

Reworks how scheduler wait-time accounting is persisted, replacing the two
ad-hoc EMA fields with the full WaitStats breakdown:

  • VQueueStatistics now keeps an EMA over all seven WaitStats buckets
    (avg_wait_stats), applied only on scheduler-driven moves so zero samples
    from non-scheduler transitions no longer dilute the averages.
  • EntryStatistics records latest_attempt_wait_stats and a saturating
    total_wait_stats across all attempts, plus a new num_errors counter:
    yields caused by transient errors now count as errors instead of yields.
  • MoveMetrics carries Option<WaitStats> instead of two loose u32s.
  • WaitStats switches to fixed encoding and gains ema_apply / saturating
    Add.

Surfaced in SQL: sys_vqueues.num_errors, and
sys_vqueue_meta.avg_blocked_on_lock / avg_blocked_on_invoker_concurrency.
Entry Status display is now kebab-case (backing-off) to match
sys_invocation conventions.

Note: breaks the persisted codec for vqueue entries, meta, and pending merge
operands (tag renumbering/type changes). Acceptable while vqueues remain
experimental and fresh-cluster only.

Reworks how scheduler wait-time accounting is persisted, replacing the two
ad-hoc EMA fields with the full `WaitStats` breakdown:

- `VQueueStatistics` now keeps an EMA over all seven `WaitStats` buckets
  (`avg_wait_stats`), applied only on scheduler-driven moves so zero samples
  from non-scheduler transitions no longer dilute the averages.
- `EntryStatistics` records `latest_attempt_wait_stats` and a saturating
  `total_wait_stats` across all attempts, plus a new `num_errors` counter:
  yields caused by transient errors now count as errors instead of yields.
- `MoveMetrics` carries `Option<WaitStats>` instead of two loose `u32`s.
- `WaitStats` switches to fixed encoding and gains `ema_apply` / saturating
  `Add`.

Surfaced in SQL: `sys_vqueues.num_errors`, and
`sys_vqueue_meta.avg_blocked_on_lock` / `avg_blocked_on_invoker_concurrency`.
Entry `Status` display is now kebab-case (`backing-off`) to match
`sys_invocation` conventions.

Note: breaks the persisted codec for vqueue entries, meta, and pending merge
operands (tag renumbering/type changes). Acceptable while vqueues remain
experimental and fresh-cluster only.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6ba0bafc23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +50 to +51
$this.$attr = if $this.$attr == 0 {
$other.$attr
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include zero samples in per-bucket EMAs

When a wait bucket has only seen zero-valued scheduler samples, this branch keeps treating the bucket as uninitialized and replaces it with the first nonzero sample instead of blending it with the earlier zero samples. For example, after many successful runs with blocked_on_lock_ms == 0, the first locked run will make sys_vqueue_meta.avg_blocked_on_lock jump to the full wait time; VQueueMeta::apply_update calls ema_apply for each scheduler-provided sample on Running transitions, so those zero attempts should affect the average rather than being ignored.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, this is intentional to be reflect positive changes from zero quicker than the decay, otherwise it'll take long until users start observing.

@AhmedSoliman AhmedSoliman requested a review from tillrohrmann June 5, 2026 21:03
@AhmedSoliman AhmedSoliman added this to the 1.7 milestone Jun 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

Test Results

  8 files  ±0    8 suites  ±0   4m 44s ⏱️ -1s
 60 tests ±0   60 ✅ ±0  0 💤 ±0  0 ❌ ±0 
267 runs  ±0  267 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 6ba0baf. ± Comparison against base commit cb8c35a.

♻️ This comment has been updated with latest results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant