pal_uring: quiesce minircu before blocking in io_uring worker#3490
pal_uring: quiesce minircu before blocking in io_uring worker#3490smalis-msft wants to merge 1 commit into
Conversation
Every io-uring worker thread registers itself as a non-quiesced minircu reader the first time it polls a future that enters an RCU read-side critical section (e.g. anything touching guestmem). Workers then stay registered for the lifetime of the thread, so every guestmem::rcu().synchronize_blocking() writer in underhill_mem (page visibility / VTL protection changes) is forced to broadcast membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED) to every CPU. On large isolated VMs (observed on 64-VP SNP) that broadcast IPI can stall long enough waiting for VPs running VTL0 to context-switch that the openhcl kernel logs rcu: INFO: rcu_preempt self-detected stall on CPU ... smp_call_function_many_cond+0x113/0x300 __x64_sys_membarrier+0x27c/0x360 from inside one of the 'tp' worker threads, and dependent tests eventually hit their nextest timeout. minircu's own docs prescribe the fix: // For best performance, ensure all threads in your process call // 'quiesce' when a thread is going to sleep or block. Call minircu::global().quiesce() in the io-uring worker loop immediately before submit_and_wait. Once quiesced, the worker is invisible to membarrier broadcasts until it next enters a critical section, at which point ThreadData::enter_slow already emits a fence(SeqCst) to publish the transition, so correctness is preserved. This mirrors the existing quiesce call in the VTL2 VP loop in virt_mshv_vtl.
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
There was a problem hiding this comment.
Pull request overview
This PR addresses recurring Linux kernel rcu_preempt self-detected stall warnings seen in OpenHCL on large isolated VMs by ensuring pal_uring io_uring worker threads explicitly quiesce the global minircu domain immediately before blocking in io_uring_enter (via submit_and_wait). This reduces unnecessary process-wide membarrier(PRIVATE_EXPEDITED) broadcasts triggered by synchronize_blocking() writers when idle worker threads remain registered as non-quiesced readers.
Changes:
- Quiesce the global
minircudomain in the io_uring worker loop just before blocking onsubmit_and_wait. - Add
minircuas a Linux-only dependency ofpal_uring(and updateCargo.lockaccordingly).
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| support/pal/pal_uring/src/threadpool.rs | Quiesces the global RCU domain before blocking in the worker loop to avoid unnecessary membarrier broadcasts from writers. |
| support/pal/pal_uring/Cargo.toml | Adds minircu as a target_os = "linux" dependency needed for the new quiesce call. |
| Cargo.lock | Records the new pal_uring -> minircu dependency edge. |
Summary
Fix recurring
rcu_preempt self-detected stallfailures in OpenHCL on large isolated VMs (most visible on thex64-windows-amd-snpCI runner, e.g. thememory_validation_debug_very_heavytest) by having io-uring worker threads quiesce the global minircu domain immediately before blocking inio_uring_enter.Root cause
Every
pal_uringthreadpool worker (the threads namedtp) registers itself as a non-quiesced reader in the global minircu domain the first time it polls a future that enters an RCU read-side critical section — which happens as soon as a worker touches anything inguestmem. After that point the worker stays registered for the lifetime of the thread.The only production caller of
minircu::global().quiesce()is the VTL2 VP loop inopenhcl/virt_mshv_vtl/src/processor/mod.rs. Threadpool workers never quiesced, so everyguestmem::rcu().synchronize_blocking()writer inopenhcl/underhill_mem/src/lib.rs(five hot call sites — page visibility / VTL protection updates) was forced to broadcastmembarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED)to every CPU.On a 64-VP SNP CVM with most VPs running VTL0 guest code, the IPI broadcast can stall for tens of seconds waiting for VPs to context-switch, long enough to trigger:
inside one of the
tpworkers, after which dependent tests eventually hit their nextest timeout. This was observed e.g. on PR #3487 CI in run25867445619onx64-windows-amd-snp.This is essentially the same class of symptom as #2334 (TDX AP-start stall) but with a different trigger.
Fix
minircu's own docs (support/minircu/src/lib.rs) prescribe the fix:Call
minircu::global().quiesce()in the io-uring worker loop just beforesubmit_and_wait. Once quiesced, the worker is invisible to membarrier broadcasts until it next enters a critical section, at which pointThreadData::enter_slowalready emits afence(SeqCst)to publish the transition — so correctness is preserved. This mirrors the existing quiesce call in the VTL2 VP loop.Cost: a single TLS load + relaxed atomic store per idle cycle on the worker. No-op for threads that have never entered a critical section.