Skip to content

pal_uring: quiesce minircu before blocking in io_uring worker#3490

Open
smalis-msft wants to merge 1 commit into
microsoft:mainfrom
smalis-msft:pal_uring-quiesce-rcu
Open

pal_uring: quiesce minircu before blocking in io_uring worker#3490
smalis-msft wants to merge 1 commit into
microsoft:mainfrom
smalis-msft:pal_uring-quiesce-rcu

Conversation

@smalis-msft
Copy link
Copy Markdown
Contributor

@smalis-msft smalis-msft commented May 14, 2026

Summary

Fix recurring rcu_preempt self-detected stall failures in OpenHCL on large isolated VMs (most visible on the x64-windows-amd-snp CI runner, e.g. the memory_validation_debug_very_heavy test) by having io-uring worker threads quiesce the global minircu domain immediately before blocking in io_uring_enter.

Root cause

Every pal_uring threadpool worker (the threads named tp) registers itself as a non-quiesced reader in the global minircu domain the first time it polls a future that enters an RCU read-side critical section — which happens as soon as a worker touches anything in guestmem. After that point the worker stays registered for the lifetime of the thread.

The only production caller of minircu::global().quiesce() is the VTL2 VP loop in openhcl/virt_mshv_vtl/src/processor/mod.rs. Threadpool workers never quiesced, so every guestmem::rcu().synchronize_blocking() writer in openhcl/underhill_mem/src/lib.rs (five hot call sites — page visibility / VTL protection updates) was forced to broadcast membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED) to every CPU.

On a 64-VP SNP CVM with most VPs running VTL0 guest code, the IPI broadcast can stall for tens of seconds waiting for VPs to context-switch, long enough to trigger:

rcu: INFO: rcu_preempt self-detected stall on CPU
...
smp_call_function_many_cond+0x113/0x300
__x64_sys_membarrier+0x27c/0x360

inside one of the tp workers, after which dependent tests eventually hit their nextest timeout. This was observed e.g. on PR #3487 CI in run 25867445619 on x64-windows-amd-snp.

This is essentially the same class of symptom as #2334 (TDX AP-start stall) but with a different trigger.

Fix

minircu's own docs (support/minircu/src/lib.rs) prescribe the fix:

"For best performance, ensure all threads in your process call quiesce when a thread is going to sleep or block."

Call minircu::global().quiesce() in the io-uring worker loop just before submit_and_wait. Once quiesced, the worker is invisible to membarrier broadcasts until it next enters a critical section, at which point ThreadData::enter_slow already emits a fence(SeqCst) to publish the transition — so correctness is preserved. This mirrors the existing quiesce call in the VTL2 VP loop.

Cost: a single TLS load + relaxed atomic store per idle cycle on the worker. No-op for threads that have never entered a critical section.

Every io-uring worker thread registers itself as a non-quiesced minircu
reader the first time it polls a future that enters an RCU read-side
critical section (e.g. anything touching guestmem). Workers then stay
registered for the lifetime of the thread, so every
guestmem::rcu().synchronize_blocking() writer in underhill_mem (page
visibility / VTL protection changes) is forced to broadcast
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED) to every CPU.

On large isolated VMs (observed on 64-VP SNP) that broadcast IPI can
stall long enough waiting for VPs running VTL0 to context-switch that
the openhcl kernel logs

  rcu: INFO: rcu_preempt self-detected stall on CPU
  ...
  smp_call_function_many_cond+0x113/0x300
  __x64_sys_membarrier+0x27c/0x360

from inside one of the 'tp' worker threads, and dependent tests
eventually hit their nextest timeout.

minircu's own docs prescribe the fix:

  // For best performance, ensure all threads in your process call
  // 'quiesce' when a thread is going to sleep or block.

Call minircu::global().quiesce() in the io-uring worker loop immediately
before submit_and_wait. Once quiesced, the worker is invisible to
membarrier broadcasts until it next enters a critical section, at which
point ThreadData::enter_slow already emits a fence(SeqCst) to publish
the transition, so correctness is preserved. This mirrors the existing
quiesce call in the VTL2 VP loop in virt_mshv_vtl.
@smalis-msft smalis-msft requested a review from a team as a code owner May 14, 2026 16:51
Copilot AI review requested due to automatic review settings May 14, 2026 16:51
@github-actions github-actions Bot added the unsafe Related to unsafe code label May 14, 2026
@github-actions
Copy link
Copy Markdown

⚠️ Unsafe Code Detected

This PR modifies files containing unsafe Rust code. Extra scrutiny is required during review.

For more on why we check whole files, instead of just diffs, check out the Rustonomicon

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses recurring Linux kernel rcu_preempt self-detected stall warnings seen in OpenHCL on large isolated VMs by ensuring pal_uring io_uring worker threads explicitly quiesce the global minircu domain immediately before blocking in io_uring_enter (via submit_and_wait). This reduces unnecessary process-wide membarrier(PRIVATE_EXPEDITED) broadcasts triggered by synchronize_blocking() writers when idle worker threads remain registered as non-quiesced readers.

Changes:

  • Quiesce the global minircu domain in the io_uring worker loop just before blocking on submit_and_wait.
  • Add minircu as a Linux-only dependency of pal_uring (and update Cargo.lock accordingly).

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.

File Description
support/pal/pal_uring/src/threadpool.rs Quiesces the global RCU domain before blocking in the worker loop to avoid unnecessary membarrier broadcasts from writers.
support/pal/pal_uring/Cargo.toml Adds minircu as a target_os = "linux" dependency needed for the new quiesce call.
Cargo.lock Records the new pal_uring -> minircu dependency edge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unsafe Related to unsafe code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants