[wip] emulated SMMU#3458
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds support for an emulated SMMUv3 on aarch64 and updates PCIe MSI routing to support GICv3 ITS (device-id based routing) in addition to GICv2m.
Changes:
- Introduces SMMUv3 emulation (spec types + translation logic) and plumbs per-device bus-range identity to support ITS/SMMU requester/device ID composition.
- Adds ACPI IORT generation (and DT
iommu-map) for PCIe interrupt/DMA remapping; adds MADT ITS entry and backend ITS capability detection (KVM). - Updates MSI/irqfd plumbing to carry an optional device identity (
devid) end-to-end.
Reviewed changes
Copilot reviewed 70 out of 71 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| vmm_core/vmotherboard/src/lib.rs | Re-exports PCIe bus-range identity type for consumers. |
| vmm_core/vmotherboard/src/chipset/builder/mod.rs | Threads optional PCIe device identity through builder registrations. |
| vmm_core/vmotherboard/src/chipset/backing/arc_mutex/services.rs | Extends PCIe registration API to accept optional device identity. |
| vmm_core/vmotherboard/src/chipset/backing/arc_mutex/pci.rs | Stores and forwards optional device identity during PCIe bus resolution. |
| vmm_core/vmotherboard/src/chipset/backing/arc_mutex/device.rs | Adds builder hook to attach PCIe bus-range identity to devices. |
| vmm_core/vmotherboard/src/base_chipset.rs | Forwards optional device identity into PCIe enumerator device attach. |
| vmm_core/virt_whp/src/synic.rs | Adapts SignalMsi signature to optional device identity. |
| vmm_core/virt_whp/src/lib.rs | Switches to topology-provided MSI controller config; advertises ITS support=false. |
| vmm_core/virt_whp/src/device.rs | Adapts SignalMsi signature to optional device identity. |
| vmm_core/virt_mshv/src/x86_64/mod.rs | Adapts SignalMsi signature to optional device identity. |
| vmm_core/virt_mshv/src/irqfd.rs | Adapts IrqFdRoute::enable signature to accept optional device identity. |
| vmm_core/virt_mshv/src/aarch64/mod.rs | Adapts MSI signaling for new SignalMsi API; advertises ITS support=false. |
| vmm_core/virt_kvm/src/lib.rs | Stores MSI controller config and ITS device FD; prepares KVM backend for ITS. |
| vmm_core/virt_kvm/src/gsi.rs | Plumbs optional devid into KVM irq routing builder path. |
| vmm_core/virt_kvm/src/arch/x86_64/mod.rs | Sets devid=None for x86 MSI routes; adapts SignalMsi signature. |
| vmm_core/virt_kvm/src/arch/aarch64/mod.rs | Probes ITS support, creates in-kernel ITS, adds ITS irqfd/MSI routing support. |
| vmm_core/virt_hvf/src/lib.rs | Advertises ITS support=false. |
| vmm_core/virt/src/x86/apic_software_device.rs | Adapts MSI forwarding to new SignalMsi API. |
| vmm_core/virt/src/generic.rs | Extends PlatformInfo with ITS capability and adapts SignalMsi signature. |
| vmm_core/virt/src/aarch64/gic_v2m.rs | Adapts SignalMsi signature to optional device identity. |
| vmm_core/virt/src/aarch64/gic_software_device.rs | Adapts SignalMsi signature to optional device identity. |
| vmm_core/src/device_builder.rs | Accepts per-device bus-range identity and passes into PCIe device builder. |
| vmm_core/src/acpi_builder.rs | Adds IORT construction + SMMU config, MADT ITS entries, and extensive tests. |
| vm/vmcore/vm_topology/src/processor/aarch64.rs | Replaces gic_v2m with gic_msi controller enum (None/V2m/Its). |
| vm/vmcore/src/irqfd.rs | Extends irqfd route enable API with optional device identity. |
| vm/kvm/src/lib.rs | Adds MSI route devid support and propagates flags into KVM irq routing. |
| vm/devices/virtio/virtio/src/transport/core.rs | Forces access_platform feature bit for virtio devices behind an IOMMU. |
| vm/devices/user_driver_emulated_mock/src/lib.rs | Updates MSI controller mock to ignore device identity. |
| vm/devices/storage/nvme_test/src/tests/test_helpers.rs | Updates MSI test helper to new SignalMsi signature. |
| vm/devices/storage/nvme/src/tests/test_helpers.rs | Updates MSI test helper to new SignalMsi signature. |
| vm/devices/pci/vpci/src/test_helpers/mod.rs | Updates MSI test helper to new SignalMsi signature. |
| vm/devices/pci/pcie/src/switch.rs | Uses port-side-effecting cfg write path; plumbs optional bus-range identity. |
| vm/devices/pci/pcie/src/root.rs | Plumbs optional bus-range identity, ensures port tracks bus-range on cfg writes, adds tests. |
| vm/devices/pci/pcie/src/port.rs | Adds shared assigned-bus-range tracking and cfg-write side effects. |
| vm/devices/pci/pcie/src/lib.rs | Exposes new bus_range + its modules. |
| vm/devices/pci/pcie/src/its.rs | Adds ITS wrappers for SignalMsi and IrqFd that inject device IDs. |
| vm/devices/pci/pcie/src/bus_range.rs | Adds shared atomic bus-range tracking and device/stream ID composition helpers. |
| vm/devices/pci/pcie/fuzz/fuzz_pcie.rs | Updates fuzz harness for new PCIe add-device signature. |
| vm/devices/pci/pcie/Cargo.toml | Adds pal_event dependency for irqfd route wrapper event access. |
| vm/devices/pci/pci_core/src/test_helpers/mod.rs | Updates MSI test helper to new SignalMsi signature. |
| vm/devices/pci/pci_core/src/msi.rs | Updates SignalMsi API; adds route/target helpers to pass optional device identity. |
| vm/devices/pci/pci_core/src/capabilities/msix.rs | Updates MSI-X delivery to new MsiTarget API. |
| vm/devices/iommu/smmu/src/translate.rs | Adds SMMUv3 STE/CD lookup and stage-1 page table walker + tests. |
| vm/devices/iommu/smmu/src/spec/ste.rs | Adds SMMUv3 STE layout/types + tests. |
| vm/devices/iommu/smmu/src/spec/registers.rs | Adds SMMUv3 register offsets/bitfields + tests. |
| vm/devices/iommu/smmu/src/spec/pt.rs | Adds AArch64 stage-1 page table descriptor helpers + tests. |
| vm/devices/iommu/smmu/src/spec/mod.rs | Exposes SMMU spec modules. |
| vm/devices/iommu/smmu/src/spec/events.rs | Adds SMMU event queue entry types + constructors + tests. |
| vm/devices/iommu/smmu/src/spec/commands.rs | Adds SMMU command queue entry types + helpers + tests. |
| vm/devices/iommu/smmu/src/spec/cd.rs | Adds SMMU context descriptor layout/types + tests. |
| vm/devices/iommu/smmu/src/lib.rs | Introduces new smmu crate module surface. |
| vm/devices/iommu/smmu/Cargo.toml | Adds new smmu crate definition + dependencies. |
| vm/acpi_spec/src/madt.rs | Adds MADT GIC ITS structure support. |
| vm/acpi_spec/src/lib.rs | Exposes new ACPI IORT module. |
| vm/acpi_spec/src/iort.rs | Adds IORT node/mapping structures used by ACPI builder. |
| tmk/tmk_vmm/src/run.rs | Updates aarch64 platform config to use gic_msi. |
| openvmm/openvmm_entry/src/lib.rs | Adds CLI/config wiring for GIC MSI controller selection and SMMU instances. |
| openvmm/openvmm_entry/src/cli_args.rs | Adds --gic-msi and --smmu CLI flags for aarch64. |
| openvmm/openvmm_defs/src/config.rs | Adds defaults for ITS/SMMU MMIO layout and SMMU/GIC MSI config structs. |
| openvmm/openvmm_core/src/worker/vm_loaders/linux.rs | Builds DT with ITS and SMMU nodes + iommu-map; passes SMMU configs. |
| openvmm/openvmm_core/src/worker/dispatch.rs | Selects ITS vs v2m, instantiates SMMU devices, wraps per-device MSI/irqfd/memory. |
| openvmm/openvmm_core/Cargo.toml | Adds smmu dependency to OpenVMM core. |
| openhcl/virt_mshv_vtl/src/lib.rs | Updates SignalMsi implementation signature. |
| openhcl/underhill_core/src/loader/mod.rs | Extends loader config to include (placeholder) SMMU base field. |
| openhcl/bootloader_fdt_parser/src/lib.rs | Updates parsed platform config to use gic_msi. |
| Guide/src/reference/emulated/pcie/overview.md | Documents aarch64 MSI routing via ITS vs v2m and the new CLI flag. |
| Guide/src/reference/devices/firmware/linux_direct.md | Updates docs to mention ITS/IORT in ACPI mode for PCIe routing. |
| Cargo.toml | Adds new workspace crate smmu. |
Comments suppressed due to low confidence (4)
vmm_core/src/acpi_builder.rs:1
- The IORT RC mapping logic uses a global
rc_mapping_countand defaults an unmapped RC toits_group_offseteven when there is no ITS. Ifhas_smmu == trueandhas_its == false(and not every RC is covered by an SMMU), RCs without an SMMU will incorrectly map to offsetIORT_NODE_OFFSET(which will be the first SMMU node), effectively claiming they are behind the wrong SMMU. Fix by computing the mapping count and target per root complex: emit an RC ID mapping only if that RC has an SMMU offset, or if an ITS is actually present; otherwise set that RC node’s mapping_count to 0 and append noIortIdMappingentry.
vmm_core/src/acpi_builder.rs:1 - The IORT RC mapping logic uses a global
rc_mapping_countand defaults an unmapped RC toits_group_offseteven when there is no ITS. Ifhas_smmu == trueandhas_its == false(and not every RC is covered by an SMMU), RCs without an SMMU will incorrectly map to offsetIORT_NODE_OFFSET(which will be the first SMMU node), effectively claiming they are behind the wrong SMMU. Fix by computing the mapping count and target per root complex: emit an RC ID mapping only if that RC has an SMMU offset, or if an ITS is actually present; otherwise set that RC node’s mapping_count to 0 and append noIortIdMappingentry.
vmm_core/src/acpi_builder.rs:1 - The test suite exercises IORT generation with ITS and with SMMU+ITS, but doesn’t cover the important configuration where
has_smmu == trueandhas_its == false(including the case where only a subset of RCs are covered by SMMUs). Adding tests for “SMMU without ITS” and “partial RC coverage” would catch incorrect RC mapping counts/targets (and would have exposed the current incorrectunwrap_or(its_group_offset)fallback when no ITS exists).
vm/devices/pci/pci_core/src/capabilities/msix.rs:217 - With the new optional
devidplumbing intended for ITS routing, this MSI-X delivery path always signals withdevid=None, which prevents identifying the correct PCI function for multi-function devices (where ITS device ID must include the function number). If multi-function endpoints are in scope for ITS mode, consider extending the MSI-X interrupt target state to carry the function’s BDF (or RID) and signaling withsignal_msi_with_rid(...)(or passingSome(bdf)down to the ITS wrapper) so the composed ITS device ID is accurate.
fn deliver(&self) {
let mut state = self.0.lock();
if state.enabled {
state.target.signal_msi(state.address, state.data);
} else {
state.pending = true;
}
}
| // through its SMMU instance. | ||
| node = node.add_u32_array(p_iommu_map, &[0, *phandle, 0, 0x10000])?; |
| fn compute_start_level(tg0: Tg0, t0sz: u8) -> Option<(u8, u8)> { | ||
| let va_bits = 64u8.checked_sub(t0sz)?; | ||
| let bits_per_level = tg0.bits_per_level()?; | ||
| let page_shift = tg0.page_shift()?; | ||
|
|
||
| // Number of address bits resolved by the page table walk (excluding page | ||
| // offset). For 4K/9 bits per level: va_bits - 12 bits are resolved by | ||
| // the walk. | ||
| let resolve_bits = va_bits.checked_sub(page_shift)?; | ||
|
|
||
| // Number of full levels needed = ceil(resolve_bits / bits_per_level). | ||
| // Start level = 4 - num_levels (levels are numbered 0..3). | ||
| let num_levels = resolve_bits.div_ceil(bits_per_level); | ||
| if num_levels > 4 { | ||
| return None; | ||
| } | ||
| let start_level = 4 - num_levels; | ||
|
|
||
| Some((start_level, va_bits)) | ||
| } |
| if state.pending { | ||
| state.target.signal_msi(0, address, data); | ||
| state.target.signal_msi(address, data); | ||
| state.pending = false; | ||
| } |
|
This PR modifies files containing For more on why we check whole files, instead of just diffs, check out the Rustonomicon |
The previous MSI architecture required each ITS wrapper to carry its own AssignedBusRange and perform BDF resolution internally, and the MsiConnection had to be constructed with an IrqFd upfront. This made it impossible to wire MSI for PCIe switch downstream ports (they did not have access to the right bus range or signal target at construction time), breaking hotplug on switches and creating a tangle of push-based state synchronization (set_rid, write_cfg, sync_msi_rid) in the port layer. This change restructures the MSI model around two principles: 1. Lazy BDF resolution: MsiConnection::new(bus_range, devfn) takes a bus range at construction. When a device signals an MSI with devid = None, the MsiTarget resolves the BDF from the bus range current secondary bus and the configured devfn. This means the guest can reprogram bus numbers and MSI delivery automatically picks up the new values -- no push-based synchronization needed. 2. Late-bind connect: Both SignalMsi and IrqFd are connected after construction via connect() and connect_irqfd(), not passed at creation time. This separates device resolution (which needs the target) from interrupt wiring (which needs platform knowledge like whether ITS is active), and allows the same pattern for all device types. The ITS wrappers (ItsSignalMsi, ItsIrqFd) are simplified to pure segment prependers -- they just compose (segment << 16) | bdf from the already-resolved BDF. They no longer carry bus ranges or perform range validation. For the switch, GenericPcieSwitchDefinition now takes an MsiTarget instead of an MsiConnection. The switch uses MsiTarget::with_bus_range to re-derive targets using the upstream port bus range, then with_devfn for each downstream port. This means switch downstream ports get properly wired MSI targets that share the parent connection SignalMsi and IrqFd -- fixing hotplug on switches. The resolve_and_add_pci_device helper is simplified to take &MsiTarget directly, with callers owning the MsiConnection and handling connect calls themselves.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 50 out of 51 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (5)
vm/devices/pci/pci_core/src/msi.rs:1
devid = Noneis always resolved into a concrete BDF and forwarded asSome(resolved). This breaks callers/wrappers that rely onNonemeaning “identity not yet available” (e.g. your updatedpcie::its::ItsSignalMsiexplicitly drops MSIs whendevidisNone). With the current code, devices will signal with bus=0 before secondary bus assignment, potentially producing incorrect device IDs and hard-to-debug interrupt routing. Consider preservingNonesemantics by (a) passing throughNonewhen the default bus-range is unassigned (e.g. secondary bus == 0), or (b) forwardingdevidunchanged and moving ‘default BDF’ resolution to callers that truly want it.
vm/devices/iommu/smmu/src/translate.rs:1compute_start_levelcan returnstart_level = 4whenresolve_bits == 0(e.g., VA bits equal to granule page shift).walk_s1then useslevel = start_leveland computesshift = page_shift + (3 - level) * bits_per_level, which underflows for level=4 and yields a huge shift amount. This is a concrete correctness bug and can lead to incorrect indexing or panics. Fix by rejecting configurations whereresolve_bits == 0(or any invalidt0szrange for the selected granule) sostart_levelis always in 0..=3, and/or by using checked arithmetic in the shift computation and returning anF_TRANSLATIONfault when parameters are invalid.
// Copyright (c) Microsoft Corporation.
vmm_core/src/device_builder.rs:1
build_vpci_devicenow creates anMsiConnectionlocally and passes onlymsi_conn.target()into device resolution, but theMsiConnectionis dropped at the end of the function and there is no subsequentconnect(...)/connect_irqfd(...). That means the vPCI device’s MSI target will remain disconnected and cannot be wired up later. To fix, either (1) accept anmsi_target: &MsiTarget(similar tobuild_pcie_device) from the caller that owns/keeps theMsiConnection, or (2) restore returning theMsiConnectionso the caller can connect it after building.
vmm_tests/vmm_tests/tests/tests/multiarch/pcie.rs:1- This writes directly to the first
nvme*block device detected. On some guests that can be the boot/root disk or otherwise mounted, making the test destructive/flaky (and potentially corrupting the VM state before shutdown). Recommend selecting the specific NVMe behind the SMMU root complex via a stable path (e.g.,/dev/disk/by-pathfor the segment/port), and additionally filtering out any device that backs/(e.g., compare againstfindmnt -n -o SOURCE //lsblk -no NAME,MOUNTPOINT).
vmm_core/src/acpi_builder.rs:1 - The updated comment for
id_countno longer matches the earlier note in this code path that referenced the IORT spec’s ‘minus 1’ behavior. IfIortIdMapping::newexpects the IORT-defined encoding (commonly “number of IDs minus 1”), then0xFFFFrepresents 0x10000 IDs, but the current comment reads like it represents exactly 0xFFFF IDs. Please align the comment with the actual IORT field semantics (and/or rename the constructor parameter) to avoid future off-by-one regressions.
| /// Returns the output address for a 16KB granule. | ||
| pub fn output_address_16k(&self, level: u8) -> u64 { | ||
| let raw = self.addr_bits() << 12; | ||
| match level { | ||
| // L1 block: 32MB (bits [47:25]), but 16K L1 blocks are unusual | ||
| 1 => { | ||
| if self.is_block() { | ||
| raw & !((1u64 << 25) - 1) | ||
| } else { | ||
| raw | ||
| } | ||
| } | ||
| // L2 block: 32MB (bits [47:25]) | ||
| 2 => { | ||
| if self.is_block() { | ||
| raw & !((1u64 << 25) - 1) | ||
| } else { | ||
| raw | ||
| } | ||
| } | ||
| 3 => raw, // page address, 16KB aligned | ||
| _ => raw, | ||
| } | ||
| } |
| if let Some(shared) = smmu_shared { | ||
| let inner_msi = | ||
| base_signal_msi.unwrap_or_else(|| partition.as_signal_msi(Vtl::Vtl0).unwrap()); | ||
| let (translating_gm, smmu_msi) = | ||
| shared.create_device_context(bus_range.clone(), 0, gm, inner_msi); | ||
| let irqfd = | ||
| base_irqfd.map(|fd| shared.create_irqfd(0, fd) as Arc<dyn vmcore::irqfd::IrqFd>); |
No description provided.