Skip to content
Open
Show file tree
Hide file tree
Changes from 27 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions Cargo.lock
Original file line number Diff line number Diff line change
Expand Up @@ -5427,7 +5427,6 @@ dependencies = [
"input_core",
"inspect",
"inspect_proto",
"memory_range",
"mesh",
"mesh_process",
"mesh_rpc",
Expand Down Expand Up @@ -6019,7 +6018,6 @@ dependencies = [
"kmsg",
"libtest-mimic",
"linkme",
"memory_range",
"mesh",
"mesh_process",
"mesh_worker",
Expand Down
1 change: 1 addition & 0 deletions Guide/src/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@
- [Consomme](./reference/backends/consomme.md)
- [Architecture](./reference/architecture.md)
- [OpenVMM Architecture](./reference/architecture/openvmm.md)
- [Memory Layout](./reference/architecture/openvmm/memory-layout.md)
- [mesh](./reference/architecture/openvmm/mesh.md)
- [Using mesh](./reference/architecture/openvmm/mesh/usage.md)
- [How mesh works](./reference/architecture/openvmm/mesh/internals.md)
Expand Down
278 changes: 278 additions & 0 deletions Guide/src/reference/architecture/openvmm/memory-layout.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,278 @@
# Memory Layout

OpenVMM has to decide where every byte of guest physical address space goes:
RAM, MMIO windows for emulated and PCIe devices, paravisor private memory, and
architectural ranges like the LAPIC or GIC. This page describes how those
decisions are made.

```admonish warning title="Compatibility surface"
Guest physical addresses are part of the VM's compatibility contract. Guests
remember device and RAM locations across hibernation, and saved VM state
references them. Changing request order, placement class, or alignment can
move guest addresses and break resume on existing VMs.

Treat layout policy changes like VM ABI changes: a new default may be fine
for new VMs, but existing persisted configuration must continue to resolve
to the same guest physical addresses.
```

## Two pieces

Layout resolution is split into two pieces that you should think about
separately:

1. A **pure address-space allocator** in `vm_topology::layout`. It knows
nothing about chipsets, firmware, VTLs, PCI, or the host. Callers describe
what they need in terms of ranges, sizes, alignments, and a placement
class, and the allocator returns deterministic guest physical addresses.
2. A **worker resolver** in `openvmm_core::worker::memory_layout`. This is
where OpenVMM's policy lives: which platform ranges are pinned, what
alignments NUMA nodes get, how PCIe ECAM is sized, and so on. The resolver
describes the VM to the allocator, runs it, and builds the resulting
[`MemoryLayout`](https://openvmm.dev/rustdoc/linux/vm_topology/memory/struct.MemoryLayout.html)
that the rest of the VM worker uses to look up RAM, MMIO, PCI ECAM, and
PCI MMIO ranges.

Keeping the allocator policy-free means its behavior can be exhaustively
tested in isolation, and the worker can be reasoned about as a list of
requests that fully describes the VM.

## The allocator

[`LayoutBuilder`](https://openvmm.dev/rustdoc/linux/vm_topology/layout/struct.LayoutBuilder.html)
accepts four kinds of input:

| Input | Purpose |
|---|---|
| `reserve(tag, range)` | Block allocation at this address but do not include it in the layout top. |
| `fixed(tag, range)` | A range whose address is already decided. Blocks allocation and counts as part of the layout. |
| `ram(tag, target, size, alignment)` | Ordinary guest RAM. The only request type that may be split across multiple extents. |
| `request(tag, target, size, alignment, placement)` | A single contiguous range, placed dynamically. The `placement` chooses one of three phases below. |

`reserve` and `fixed` differ only in how they affect the **layout top** —
the address one past the highest guest-visible byte. `fixed` ranges raise
it; `reserve` ranges do not. This matters because the layout top determines
where post-MMIO requests (such as paravisor private memory) start: a
reserved hole high up in the address space should not push them even
higher.

When `allocate()` runs, it processes requests in a fixed phase order. Each
phase pulls from whatever address space the earlier phases left free:

1. **Reserved ranges** are removed from the free space.
2. **Fixed ranges** are removed from the free space.
3. **`Placement::Mmio32`** requests are packed *top down* below 4 GiB, so
RAM can start at GPA 0 and grow upward through the lowest free space.
4. **RAM** requests are placed *bottom up*, in caller order, splitting
around any holes left by the earlier phases. The first request starts at
GPA 0; each subsequent request starts at or above the highest address
used by previous RAM requests, so later requests never backfill
fragments earlier ones skipped. RAM is the only splittable kind.
5. **`Placement::Mmio64`** requests are packed *bottom up* starting at the
end of RAM. This makes the layout top a function of requested topology
rather than a precomputed high MMIO bucket size.
6. **`Placement::PostMmio`** requests are placed *after* everything else
(excluding reserved ranges from the "everything else"). They are for
ranges that should not affect the guest-visible top of memory.

Within `Mmio32` and `Mmio64`, requests are sorted by alignment (largest
first), then size (largest first), then caller order. This keeps large,
strictly-aligned device windows from being fragmented by small devices.
RAM and `PostMmio` use caller order verbatim: RAM order is the NUMA vnode
assignment, and `PostMmio` carries policy that should not be reordered by
alignment.

```admonish note
The allocator does not take host physical-address width as an input. The
layout is computed as a pure function of VM configuration; the worker
checks the resulting layout top against host capabilities after the fact.
This keeps guest physical addresses from shifting when the same VM moves
to a host with a different physical-address width.
```

## Worker policy

The worker resolver in
[`openvmm_core::worker::memory_layout`](https://github.com/microsoft/openvmm/blob/main/openvmm/openvmm_core/src/worker/memory_layout.rs)
issues requests in this order:

1. **Architectural reserved zone.** A `reserve` request for the
per-architecture range containing LAPIC, IOAPIC, GIC, PL011, battery,
TPM, and similar fixed-address platform devices.

| Architecture | Range |
|---|---|
| x86_64 | `0xFE00_0000..0x1_0000_0000` |
| aarch64 | `0xEF00_0000..0x1_0000_0000` |

2. **Chipset low MMIO** (`Mmio32`) — the VMOD/PCI0 `_CRS` low range for
VMBus relay devices and PIIX4 PCI BARs. 2 MB alignment.
3. **Chipset high MMIO** (`Mmio64`) — the corresponding high range. 2 MB
alignment.
4. **PCIe root complex ranges**, one per root complex:
- **ECAM** (`Mmio32`). The size is derived from the bus window as
`(end_bus - start_bus + 1) * 1 MB` (32 devices × 8 functions ×
4 KiB per config space).
- **Low MMIO** (`Mmio32`), 2 MB aligned. A caller can pin this to a
fixed range instead of supplying a size, for assigned-device, IOMMU,
and physical-topology passthrough.
- **High MMIO** (`Mmio64`), 1 GB aligned. A caller can pin this to a
fixed range as well. Per-BAR alignment would guarantee the entire
window is usable for one large BAR, but burns address space on
hosts with tight physical-address widths.
5. **Virtio-mmio slots** (`Mmio32`) — one contiguous region sized
`slot_count * 4 KiB`, when any slots are configured.
6. **RAM**, in vnode order. The first request becomes vnode 0, the second
vnode 1, and so on. Each vnode starts at or above the highest address
used by prior vnodes; vnode N+1 never backfills a fragment that vnode
N skipped. This keeps vnode ordering equal to address ordering and
turns vnode layout into a clean compatibility surface — adding a new
fixed or reserved range below RAM end can only shift the first vnode
whose own span actually covers it. Alignment depends on request size:

| RAM request size | Alignment |
|---|---|
| < 1 GB | 2 MB |
| ≥ 1 GB | 1 GB |

Sub-GB nodes use 2 MB so small NUMA nodes do not waste a full GB of
address space.
7. **VTL2 chipset MMIO** (`PostMmio`) — VTL2's own VMBus / chipset MMIO
region, when VTL2 is configured. Placed after VTL0 so enabling VTL2
does not move any VTL0 address.
8. **VTL2 private memory** (`PostMmio`) — when the IGVM file requests
layout-mode VTL2 memory, the worker takes only its size and alignment
from the IGVM relocation header. The IGVM file's relocation min/max
bounds are not fed in as constraints here; they are validated later by
the IGVM loader against the selected base. Treating them as constraints
here would over-constrain layout and could put holes in VTL0 just to
accommodate an IGVM file we will reject anyway.

After `allocate()` succeeds, the worker collects the resolved ranges into
the `MemoryLayout`'s MMIO, PCI ECAM, and PCI MMIO gap vectors, then checks
`MemoryLayout::end_of_layout()` against the host's physical-address width.
Comment on lines +152 to +153

## RAM splitting

RAM is the only splittable request, and the splitter has one rule worth
calling out: the alignment passed in is also the **split granularity**.
When a single free range cannot hold the entire request, the part placed
in that range is rounded down to the request alignment before continuing.

The practical effect is that 1 GB-aligned RAM stays in 1 GB-aligned
chunks. A small fixed hole just above the 1 GB boundary will not cause a
"nearly 1 GB" RAM extent to be placed in the interrupted range; instead,
RAM resumes at the next 1 GB boundary.

## Examples

These examples use compact synthetic configurations. Each one is covered
by tests in `vm_topology::layout` or `openvmm_core::worker::memory_layout`.

### A fixed MMIO range splits RAM

4 GB of RAM with a 1 GB fixed MMIO range from 1 GB to 2 GB:

| Input | Range |
|---|---|
| RAM request | 4 GB |
| Fixed MMIO | `0x4000_0000..0x8000_0000` |

| Output | Range |
|---|---|
| RAM | `0x0000_0000..0x4000_0000` |
| MMIO | `0x4000_0000..0x8000_0000` |
| RAM | `0x8000_0000..0x1_4000_0000` |

Total RAM is still 4 GB — the fixed range is occupied address space, not
RAM.

### GB-aligned RAM stays GB-aligned

2 GB of RAM with a tiny fixed hole just above the 1 GB boundary should
not produce a sub-GB RAM fragment:

| Input | Range |
|---|---|
| RAM request | 2 GB, 1 GB alignment |
| Fixed MMIO | `0x4010_0000..0x4020_0000` |

| Output | Range |
|---|---|
| RAM | `0x0000_0000..0x4000_0000` |
| Fixed MMIO | `0x4010_0000..0x4020_0000` |
| RAM | `0x8000_0000..0xC000_0000` |

The splitter places one full 1 GB chunk, refuses to use the interrupted
sub-GB fragment, and resumes at the next 1 GB boundary.

### Small NUMA nodes use 2 MB alignment

Two 512 MB NUMA nodes:

| Input | Size |
|---|---|
| vnode 0 RAM | 512 MB |
| vnode 1 RAM | 512 MB |

| Output | Range |
|---|---|
| vnode 0 RAM | `0x0000_0000..0x2000_0000` |
| vnode 1 RAM | `0x2000_0000..0x4000_0000` |

With 1 GB alignment each node would burn a full GB of address space.
Request order is the vnode assignment, so swapping the requests swaps the
NUMA layout.

### VTL2 does not move VTL0

Starting from 2 GB of VTL0 RAM and a fixed 1 GB MMIO hole:

| VTL0 output | Range |
|---|---|
| RAM | `0x0000_0000..0x4000_0000` |
| MMIO | `0x4000_0000..0x8000_0000` |
| RAM | `0x8000_0000..0xC000_0000` |

Adding a 2 MB VTL2 private-memory request leaves the VTL0 layout
identical and places VTL2 after the VTL0-visible top:

| Private output | Range |
|---|---|
| VTL2 | `0xC000_0000..0xC020_0000` |

`MemoryLayout::end_of_layout()` reports the VTL0-visible top.
`MemoryLayout::vtl2_range()` reports the VTL2 range separately.

### Reserved holes do not raise the layout top

A reserved range blocks allocation but is not a guest-visible resource,
so it does not push later post-MMIO ranges higher:

| Input | Range |
|---|---|
| RAM request | 2 GB |
| Reserved hole | `0xFD_0000_0000..0xFD_4000_0000` |
| Post-MMIO request | 1 MB |

| Output | Range |
|---|---|
| RAM | `0x0000_0000..0x8000_0000` |
| Post-MMIO | `0x8000_0000..0x8010_0000` |

Trailing reserved ranges are omitted from the returned allocation list,
but a reserved range that sits between real allocations is reported so
callers can see the full occupied map.

## When to update this page

Update this page when any of these change:

- the allocator's phase order or any phase's placement direction
- the semantics of `reserve`, `fixed`, `ram`, or `request`
- the architectural reserved zones or their per-architecture addresses
- the worker's RAM alignment policy
- PCIe ECAM sizing or per-BAR alignment policy
- VTL2 chipset MMIO or VTL2 private-memory placement
- the host physical-address validation step
- `MemoryLayout::end_of_layout()` or `MemoryLayout::vtl2_range()` semantics
1 change: 1 addition & 0 deletions openhcl/underhill_core/src/worker.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2384,6 +2384,7 @@ async fn new_underhill_vm(
mut chipset_devices,
pci_chipset_devices,
capabilities,
..
} = chipset
.build()
.context("failed to build chipset configuration")?;
Expand Down
Loading
Loading