From 771892311aec042d98031eaa937cfbb7320c1c63 Mon Sep 17 00:00:00 2001 From: Steven Malis Date: Thu, 14 May 2026 12:50:54 -0400 Subject: [PATCH] pal_uring: quiesce minircu before blocking in io_uring worker Every io-uring worker thread registers itself as a non-quiesced minircu reader the first time it polls a future that enters an RCU read-side critical section (e.g. anything touching guestmem). Workers then stay registered for the lifetime of the thread, so every guestmem::rcu().synchronize_blocking() writer in underhill_mem (page visibility / VTL protection changes) is forced to broadcast membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED) to every CPU. On large isolated VMs (observed on 64-VP SNP) that broadcast IPI can stall long enough waiting for VPs running VTL0 to context-switch that the openhcl kernel logs rcu: INFO: rcu_preempt self-detected stall on CPU ... smp_call_function_many_cond+0x113/0x300 __x64_sys_membarrier+0x27c/0x360 from inside one of the 'tp' worker threads, and dependent tests eventually hit their nextest timeout. minircu's own docs prescribe the fix: // For best performance, ensure all threads in your process call // 'quiesce' when a thread is going to sleep or block. Call minircu::global().quiesce() in the io-uring worker loop immediately before submit_and_wait. Once quiesced, the worker is invisible to membarrier broadcasts until it next enters a critical section, at which point ThreadData::enter_slow already emits a fence(SeqCst) to publish the transition, so correctness is preserved. This mirrors the existing quiesce call in the VTL2 VP loop in virt_mshv_vtl. --- Cargo.lock | 1 + support/pal/pal_uring/Cargo.toml | 1 + support/pal/pal_uring/src/threadpool.rs | 21 +++++++++++++++++++++ 3 files changed, 23 insertions(+) diff --git a/Cargo.lock b/Cargo.lock index 9dedcf4ba4..112af73b14 100644 --- a/Cargo.lock +++ b/Cargo.lock @@ -5782,6 +5782,7 @@ dependencies = [ "io-uring", "libc", "loan_cell", + "minircu", "once_cell", "pal", "pal_async", diff --git a/support/pal/pal_uring/Cargo.toml b/support/pal/pal_uring/Cargo.toml index 6c5a98694f..be3fb6af02 100644 --- a/support/pal/pal_uring/Cargo.toml +++ b/support/pal/pal_uring/Cargo.toml @@ -13,6 +13,7 @@ ci = [] [target.'cfg(target_os = "linux")'.dependencies] inspect.workspace = true loan_cell.workspace = true +minircu.workspace = true pal.workspace = true pal_async.workspace = true diff --git a/support/pal/pal_uring/src/threadpool.rs b/support/pal/pal_uring/src/threadpool.rs index 47bae717c9..cfb29f8f8e 100644 --- a/support/pal/pal_uring/src/threadpool.rs +++ b/support/pal/pal_uring/src/threadpool.rs @@ -279,6 +279,27 @@ impl Worker { } } + // About to block in io_uring_enter. Mark this thread + // as quiesced for the global RCU domain so that any + // concurrent `synchronize_blocking()` writer (e.g. + // `guestmem::rcu()` page-protection updates in + // OpenHCL's `underhill_mem`) can complete without + // issuing a process-wide `membarrier()` on our + // behalf. Without this, every worker that has ever + // polled a future containing a `guestmem` critical + // section stays registered as a non-quiesced RCU + // reader for the lifetime of the thread, forcing + // each writer to broadcast `membarrier(PRIVATE_ + // EXPEDITED)` to every CPU. On large isolated VMs + // (e.g. 64-VP) that broadcast can stall long + // enough to trigger kernel `rcu_preempt self- + // detected stall` warnings in + // `smp_call_function_many_cond`. Re-entering a + // critical section after this issues a local + // memory barrier via `ThreadData::enter_slow`, so + // correctness is preserved. + minircu::global().quiesce(); + state.worker.io_ring.submit_and_wait(); } }