diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1be8c9d0..9a6e1798 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,10 +6,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ## [0.2.0-alpha] — Unreleased
 
-First slice of the v0.2 "Option C" beachhead: **Point-in-Time Recovery (PITR)**
-and **Change Data Capture (CDC)**, built additively on top of the existing
-per-shard WAL v3 + dual-root manifest. No changes to the KV hot path, MVCC,
-page format, or transaction layer.
+The v0.2 enterprise beachhead. Built additively on per-shard WAL v3 + the
+dual-root manifest; no changes to the KV hot path, MVCC, page format, or
+transaction layer.
+
+**Headline capabilities landed in alpha:**
+
+- **Point-in-Time Recovery (PITR)** — `--recovery-target-lsn` /
+  `--recovery-target-time` restore to any LSN or wall-clock boundary
+  inside the WAL retention window.
+- **Change Data Capture (CDC)** — `CDC.READ` polling command with
+  Debezium-compatible JSON envelopes, resumable cursors, segment-rotation
+  safety.
+- **Hash-field TTL** — full Valkey 9.0 / 9.1 surface (`HEXPIRE` /
+  `HPEXPIRE` / `HEXPIREAT` / `HPEXPIREAT` / `HEXPIRETIME` / `HPEXPIRETIME`
+  / `HTTL` / `HPTTL` / `HPERSIST` / `HGETDEL` / `HGETEX`) with O(1) HGET +
+  HLEN fast path. Three-way benchmark vs Redis 8.0.2 / Valkey 9.1.0
+  ships in `docs/perf/2026-05-27-hash-ttl-3way-bench.md`.
+- **Tier 2 Lane A** — `SWAPDB`, `MOVE`, `COPY ... DB n`,
+  `CLUSTER REPLICAS` / `SLAVES`, `CLUSTER COUNT-FAILURE-REPORTS`. All
+  WAL-durable with cross-shard atomic semantics.
+- **Storage format v1 commitment** — RDB v2 + WAL v3 + multi-part AOF
+  manifest grouped under a single `--storage-format v1` umbrella with
+  ≥18-month LTS forward-read guarantees.
+- **Embedded sharded server** — `server::embedded::run_embedded(config,
+  cancel)` exposes the full sharded handler (with `TXN.*`) to in-process
+  embedders.
+
+**What is not yet in alpha:** PITR live-snapshot LSN wiring (P3c),
+`CDC.SUBSCRIBE` push channel (C3b), and the multi-shard master PSYNC
+deferred from v0.1.10. Tracked in `.planning/rfcs/v02-enterprise-architecture.md`.
 
 ### Docs — Hash-field TTL three-way benchmark suite (PR #127)
 
@@ -427,6 +453,96 @@ See `docs/guides/cdc.md` for consumer integration.
   and the benchmark gates (PITR restart ±10%, CDC ≥100K events/s/shard,
   write p99 ±5%).
 
+## [0.1.12] — 2026-05-12
+
+Performance & memory observability release. 50 commits since v0.1.11, no
+public API breaks, no on-disk format change. Validated on OrbStack `moon-dev`
+(2026-05-12) and locally green for both `runtime-monoio` and
+`runtime-tokio,jemalloc`.
+
+### Performance — DashTable hot-path (Phase 189, PERF-07 + PERF-09)
+
+- **Pre-sized DashTable.** `DashTable::with_capacity()` plus the new
+  `--initial-keyspace-hint <N>` flag size the segment array up front so
+  steady-state operation hits zero `split_segment` calls. Pre-size
+  invariant test confirms zero splits at 1 M keys. The 27 % CPU spent
+  in `split_segment` during SET p=16 (PERF-07) is fully eliminated.
+- **`Database::set` rewrite.** New `DashTable::insert_or_update` /
+  `Segment::insert_or_update_at` single-probe helpers replace the
+  previous `find + remove + insert` triple-probe pattern.
+- **`Segment::find` fallback elimination + force-inlined SIMD.** The
+  cold "key spilled to non-home group" fallback path is removed once
+  `has_non_home_keys` is invariant-tracked on insert (the
+  `insert_or_update_at` change above already maintains the flag);
+  the SIMD probe helpers are `#[inline(always)]`. PERF-09
+  attributed 12.65 % of `Segment::find` self-time to the fallback;
+  remaining cost is the irreducible per-hit `memcmp` confirm
+  (threshold amended to <3 %). 1 M-key correctness gate validates
+  zero false positives/negatives.
+
+### Performance — Memory observability (Phase 190)
+
+- **`moon_memory_bytes{kind=…}` Prometheus gauge.** Seven subsystem
+  labels — `dashtable`, `hnsw`, `csr`, `wal`, `sealed_replication_backlog`,
+  `allocator_overhead`, and the rolled-up `total`. Updated every
+  scrape via a single hook so the sum reconciles to `RSS` within the
+  CI tolerance window.
+- **`MEMORY DOCTOR` full schema.** Multi-line RESP response covering
+  every subsystem, the rolled-up total, and a derived `allocator_overhead`
+  pseudo-kind (RSS − Σ subsystems). Adds operator triage signal beyond
+  the legacy single-line summary.
+- **`resident_bytes()` trait** implemented across `Database`,
+  `DashTable`, `VectorStore` (HNSW + IVF), `GraphStore` (CSR + SlotMap),
+  `WalWriter`, `ReplicationBacklog` (sealed-segment side), and
+  `AllocatorOverhead`. Zero-allocation, on-demand poll.
+- **Memory steady-state CI job.** `scripts/bench-memory-steady-state.sh`
+  + baseline fixture; gate widened to `±10 %` on RSS / Σ ratio after a
+  Linux-CI tolerance pass.
+
+### Changed — Allocator UX (Phase 191)
+
+- **jemalloc `narenas:8` cap** with `--memory-arenas-cap <N>` CLI
+  override. Caps the per-CPU arena explosion that inflates VSZ on
+  high-core hosts; mostly a cosmetic fix on Linux containers but
+  produces a meaningfully tighter `top`/`ps` reading for operators.
+- **Tri-state allocator selection.** New `mimalloc-alt` cargo
+  feature alongside the existing `jemalloc` / `mimalloc` (fallback)
+  paths; mutually exclusive at compile time. A/B benchmark script
+  `scripts/bench-allocator-ab.sh` ships with the release.
+- **`docs/OPERATOR-GUIDE.md` — Memory Accounting section.** Documents
+  the VSZ-vs-RSS distinction, MEMORY DOCTOR field-by-field, and the
+  `--memory-arenas-cap` / `mimalloc-alt` tuning knobs.
+
+### Added — Dispatch Observability (Phase 177)
+
+- **`moon_dispatch_path_total{path=...}` Prometheus counter**: four-way classification of every command by shard-routing decision — `local_inline` (SIMD fast path), `local` (standard local branch), `cross_read_fast` (RwLock shared-read bypass of SPSC), `cross_spsc` (deferred cross-shard write via `PipelineBatchSlotted`). Ratio `cross_spsc / Σ` is the ground-truth signal for dispatch-layer optimization work. Zero-allocation hot-path overhead (`&'static str` labels, `#[inline]` with early-return on `!METRICS_INITIALIZED`). Verified on macOS + Linux: counter sums close exactly to driven traffic, no overcount.
+
+### Changed
+
+- **`text-index` is now a default feature.** BM25 full-text search (`FT.SEARCH` BM25 mode), `FT.AGGREGATE`, and three-way RRF hybrid fusion are included in all standard builds. No longer requires `--features text-index`. To exclude it (e.g. minimal embedded builds): `--no-default-features --features runtime-monoio,jemalloc,graph`.
+
+### Added — SDK Validation
+
+- **Python SDK `sdk/python/examples/validate.py`**: End-to-end live validator for all SDK sub-clients: ping, strings, counter, hash, list, set, zset, vector index lifecycle, graph engine, session search, semantic cache, text search (BM25 + aggregate + hybrid), and server info. Result against Moon with `text-index`: **114 PASS / 0 FAIL / 0 SKIP**. Gracefully skips text sections when server built without `text-index`.
+- **Rust SDK `sdk/rust/examples/validate.rs`**: Re-validated against `text-index` build — **85 PASS / 0 FAIL**.
+
+### Fixed — Python SDK
+
+- **`moondb.graph._parse_neighbors`**: server returns alternating `[edge_map, node_map, ...]` as flat key-value arrays (`b'id'`, `int`, `b'src'`, `int`, `b'dst'`, `int`, …). Previous parser expected positional `[node_id, label, props]` — caused `int() on b'id'` crash. Now correctly identifies node entries by `labels` key and parses them from the flat kv format.
+
+### Fixed — CI Hygiene
+
+- **`tests/pipeline_auto_index.rs`**: tighten outer cfg from `runtime-tokio` to `all(runtime-tokio, text-index)` so the file compiles to zero tests when text-index is disabled. Previously the file compiled but the FT.SEARCH text fast path was `#[cfg]`-ed out, causing `@name:corpus` queries to fall through to the KNN-only parser and panic with "invalid KNN query syntax".
+- **4 FT unwraps**: add inline `#[allow(clippy::unwrap_used)]` with invariant justifications in `vector_search/ft_text_search.rs` (3 sites inside `apply_post_processing` where `do_summarize` / `do_highlight` implies the Option is Some) and `handler_monoio/ft.rs:165` (`is_text` was derived from `query_bytes.as_ref().map_or(false, _)`). Restores the audit-unwrap baseline to 0.
+
+### Compatibility
+
+- **Wire protocol**: unchanged. Drop-in replacement for v0.1.11.
+- **Persistence on-disk format**: unchanged.
+- **Default feature set**: `text-index` is now on by default. Minimal
+  embedded builds need an explicit `--no-default-features --features
+  runtime-monoio,jemalloc,graph`.
+
 ## [0.1.11] — 2026-04-27
 
 Hot-path perf release — eliminates two atomic-CAS hot paths in the write
@@ -537,29 +653,47 @@ All 2450 lib tests passing locally (`cargo test --no-default-features
 - **Persistence on-disk format**: unchanged.
 - **Replication wire format**: unchanged.
 
-## [Unreleased]
-
-### Added — Dispatch Observability (Phase 177)
-
-- **`moon_dispatch_path_total{path=...}` Prometheus counter**: four-way classification of every command by shard-routing decision — `local_inline` (SIMD fast path), `local` (standard local branch), `cross_read_fast` (RwLock shared-read bypass of SPSC), `cross_spsc` (deferred cross-shard write via `PipelineBatchSlotted`). Ratio `cross_spsc / Σ` is the ground-truth signal for dispatch-layer optimization work. Zero-allocation hot-path overhead (`&'static str` labels, `#[inline]` with early-return on `!METRICS_INITIALIZED`). Verified on macOS + Linux: counter sums close exactly to driven traffic, no overcount.
-
-### Fixed — CI Hygiene
-
-- **`tests/pipeline_auto_index.rs`**: tighten outer cfg from `runtime-tokio` to `all(runtime-tokio, text-index)` so the file compiles to zero tests when text-index is disabled. Previously the file compiled but the FT.SEARCH text fast path was `#[cfg]`-ed out, causing `@name:corpus` queries to fall through to the KNN-only parser and panic with "invalid KNN query syntax".
-- **4 FT unwraps**: add inline `#[allow(clippy::unwrap_used)]` with invariant justifications in `vector_search/ft_text_search.rs` (3 sites inside `apply_post_processing` where `do_summarize` / `do_highlight` implies the Option is Some) and `handler_monoio/ft.rs:165` (`is_text` was derived from `query_bytes.as_ref().map_or(false, _)`). Restores the audit-unwrap baseline to 0.
-
-### Changed
-
-- **`text-index` is now a default feature.** BM25 full-text search (`FT.SEARCH` BM25 mode), `FT.AGGREGATE`, and three-way RRF hybrid fusion are included in all standard builds. No longer requires `--features text-index`. To exclude it (e.g. minimal embedded builds): `--no-default-features --features runtime-monoio,jemalloc,graph`.
-
-### Added — SDK Validation
-
-- **Python SDK `sdk/python/examples/validate.py`**: End-to-end live validator for all SDK sub-clients: ping, strings, counter, hash, list, set, zset, vector index lifecycle, graph engine, session search, semantic cache, text search (BM25 + aggregate + hybrid), and server info. Result against Moon with `text-index`: **114 PASS / 0 FAIL / 0 SKIP**. Gracefully skips text sections when server built without `text-index`.
-- **Rust SDK `sdk/rust/examples/validate.rs`**: Re-validated against `text-index` build — **85 PASS / 0 FAIL**.
-
-### Fixed — Python SDK
-
-- **`moondb.graph._parse_neighbors`**: Server returns alternating `[edge_map, node_map, ...]` as flat key-value arrays (`b'id'`, `int`, `b'src'`, `int`, `b'dst'`, `int`, …). Previous parser expected positional `[node_id, label, props]` — caused `int() on b'id'` crash. Now correctly identifies node entries by `labels` key and parses them from the flat kv format.
+## [0.1.10] — 2026-04-23
+
+Stable replication marker. **Single-shard PSYNC2 wired end-to-end and
+production-ready** for `--shards 1` master with any `--shards N` replica
+topology. Multi-shard master PSYNC is scheduled for v0.2 (see
+`.planning/rfcs/multi-shard-replication-design.md`).
+
+- **Replication** (`081c43b`): single-shard master PSYNC2 end-to-end wired,
+  REPLCONF validated, `master_link_status` reports the actual handshake
+  state instead of the legacy `up` stub.
+- **Performance**: batch-level eviction gate; `try_handle_*` paths
+  `#[inline]`-ed; DashTable carries through the v0.1.10 pre-size
+  groundwork (capacity hint + headroom).
+- **Docs**: BENCHMARK.md §2.7 updated with the 2026-04-22 GCloud
+  re-measurement; v0.1.x replication scope documented under
+  `docs/guides/clustering.mdx#replication`.
+
+## [0.1.9] — 2026-04-19
+
+**Lunaris Retriever Gap Closure.** Every v0.1.8 client-side fallback in
+the Lunaris SDK is now closed so `HybridRRFRetriever` (dense path),
+`GraphFirstRetriever`, and `PathReasoningRetriever` run Moon-native.
+
+- **Phase 167 CYP-01/02**: Cypher `CREATE` / `MERGE` writes participate
+  in `CrossStoreTxn` via `record_graph()`; `TXN.ABORT` rolls them back.
+- **Phase 168 CYP-03/06**: `coalesce()` built-in + single-hop edge-var
+  binding in variable-length `EXPAND`.
+- **Phase 169 CYP-04/05**: `shortestPath()` parser + Dijkstra executor
+  bridge with path-variable binding.
+- **Phase 170 HYB-01/02/04**: `FT.SEARCH HYBRID` dense stream honours
+  `as_of_lsn`.
+- **Phase 171 SCAT-01/02/03**: `ShardMessage::VectorSearch` +
+  `FtHybridPayload` carry `as_of_lsn` for multi-shard `AS_OF` correctness.
+- **Phase 172 PIPE-01/02/03**: pipeline-aware HSET auto-indexing
+  regression guard (3-test suite).
+
+Audit status: **PASSED_WITH_DOCUMENTED_DEFERRALS**. 15 / 20 requirements
+fully satisfied; HYB-03 BM25 MVCC deferred and closed in v0.1.10
+follow-up (G-1); Phase 173 hygiene HYG-02 handler split RFC'd.
+
+Stats: 6 phases shipped, 17 plans, 27 files changed, +2924 / −376 LOC.
 
 ## [0.1.8] — 2026-04-18
 
@@ -837,7 +971,14 @@ All 2450 lib tests passing locally (`cargo test --no-default-features
 - **Security hardening:** `deny.toml` (cargo-deny), `SECURITY.md`, `docs/THREAT-MODEL.md`, `docs/security/lua-sandbox.md`, TLS cipher suite freeze
 - **Release engineering:** `docs/versioning.md`, 6 operator runbooks, CHANGELOG CI gate, user docs (getting-started, configuration, monitoring), release pipeline SHA256 checksums + SBOM + cosign
 
-## [Earlier Unreleased] - Dispatch Hot-Path Recovery (2026-04-08)
+## [0.1.3] — 2026-04-10
+
+Production-readiness foundation: dispatch hot-path recovery, vector-search
+4× QPS + correctness fixes, and the tiered disk-offload landing with 100 %
+crash recovery across 7 persistence configurations. Bundles three work
+streams originally tracked as separate Unreleased blocks (Apr 7–8).
+
+### Dispatch Hot-Path Recovery (2026-04-08)
 
 **Pipelined SET +37%, pipelined GET +68% at p=16 after PR #43 regression recovery.**
 
@@ -934,9 +1075,7 @@ captured as todo in `.planning/todos/pending/`.
 
 ---
 
-## [Unreleased] - Vector Search 4x QPS + Correctness
-
-### Vector Search Performance & Correctness (2026-04-07)
+### Vector Search 4× QPS + Correctness (2026-04-07)
 
 **4x search QPS, 4.1x lower latency, 2.56x faster than Qdrant on real MiniLM data.**
 
@@ -991,9 +1130,9 @@ embeddings (clustered) achieve 0.92-0.97 recall with the same code.
 
 ---
 
-## [Earlier Unreleased] - Disk Offload & x86_64 Performance
+### Disk Offload & x86_64 Performance (2026-04-06)
 
-Tiered storage, crash recovery, and 2x Redis on x86_64 (Intel Xeon, io_uring).
+Tiered storage, crash recovery, and 2× Redis on x86_64 (Intel Xeon, io_uring).
 
 ### Added
 
diff --git a/README.md b/README.md
index 5fcfbbe7..350c8861 100644
--- a/README.md
+++ b/README.md
@@ -7,11 +7,12 @@
 </p>
 
 <p align="center">
-  <a href="https://github.com/pilotspace/moon/releases/tag/v0.1.10"><img src="https://img.shields.io/badge/version-v0.1.10-blue" alt="Version"></a>
+  <a href="https://github.com/pilotspace/moon/releases/tag/v0.1.12"><img src="https://img.shields.io/badge/version-v0.1.12-blue" alt="Version"></a>
   <a href="https://crates.io/crates/moondb"><img src="https://img.shields.io/crates/v/moondb?label=moondb" alt="Rust SDK"></a>
   <a href="https://pypi.org/project/moondb/"><img src="https://img.shields.io/pypi/v/moondb?label=moondb" alt="Python SDK"></a>
   <a href="LICENSE"><img src="https://img.shields.io/badge/license-Apache--2.0-blue" alt="License"></a>
-  <img src="https://img.shields.io/badge/status-experimental-orange" alt="Status">
+  <img src="https://img.shields.io/badge/single--node-production--grade-success" alt="Status">
+  <img src="https://img.shields.io/badge/cluster-v0.2%20alpha-yellow" alt="Cluster status">
   <img src="https://img.shields.io/badge/rust-edition%202024-orange" alt="Rust">
   <img src="https://img.shields.io/badge/redis--compatible-RESP2%2FRESP3-red" alt="Protocol">
 </p>
@@ -20,17 +21,38 @@
   <a href="#quick-start">Quick start</a> &bull;
   <a href="#why-moon">Why Moon</a> &bull;
   <a href="#benchmarks">Benchmarks</a> &bull;
+  <a href="#moon-vs-redis-vs-valkey">Moon vs Redis vs Valkey</a> &bull;
+  <a href="#production-readiness">Production readiness</a> &bull;
   <a href="docs/index.mdx">Docs</a> &bull;
   <a href="CHANGELOG.md">Changelog</a>
 </p>
 
 ---
 
-> **⚠ Experimental.** Moon is under active development and **not** recommended for production. Storage formats, APIs, and config flags may change between releases. Please [open an issue](https://github.com/pilotspace/moon/issues) if something breaks.
+> **Production-grade architecture, pre-1.0 maturity.** Single-node Moon
+> (`--shards N` master, `--shards 1` for replication-eligible workloads)
+> is recommended for production caching, vector / graph / feature-store
+> workloads, and Redis-compatible OLTP. Multi-node clustering and
+> multi-shard master PSYNC are **alpha in v0.2** — see
+> [Production readiness](#production-readiness) for the honest matrix of
+> what is and isn't yet GA. Wire protocol and on-disk format are LTS as of
+> v0.2 (`docs/STORAGE-FORMAT-V1.md`); CLI flags may still evolve until
+> v1.0. [Open an issue](https://github.com/pilotspace/moon/issues) if
+> something breaks.
 
 ---
 
-Moon speaks the Redis wire protocol (RESP2/RESP3) and implements 230+ commands. It runs on **Linux** (io_uring via monoio) and **macOS** (kqueue via monoio) with a thread-per-core, shared-nothing architecture, per-shard WAL, tiered disk offload, an in-process vector search engine with BM25 full-text search, a property graph engine with Cypher subset, cross-store ACID transactions, workspace partitioning, durable message queues, bi-temporal MVCC, and an embedded web console. Any Redis client connects out of the box.
+Moon is a clean-room Rust rewrite of a Redis-compatible in-memory data
+store with first-class AI primitives. It speaks the Redis wire protocol
+(RESP2/RESP3) and implements 230+ commands — every standard Redis data
+type plus native `FT.*` vector + BM25 search, `GRAPH.*` Cypher, `TXN.*`
+cross-store ACID, workspaces, durable message queues, bi-temporal MVCC,
+and an embedded web console. Primary target is **Linux** with io_uring
+(`monoio`); a `tokio` runtime is available for portability. **macOS** is a
+supported development platform (kqueue via monoio); production
+deployments should target Linux (see
+[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) Tier 1/2).
+Any Redis client connects out of the box.
 
 ## Why Moon
 
@@ -61,18 +83,30 @@ Moon speaks the Redis wire protocol (RESP2/RESP3) and implements 230+ commands.
 
 ## Benchmarks
 
-Measured vs Redis 8.6.1, co-located client and server, pipeline depth tuned per row. Full methodology and reproduction steps in [BENCHMARK.md](BENCHMARK.md) and [docs/benchmarks.mdx](docs/benchmarks.mdx).
+Measured vs Redis 8.6.1 (peak throughput) and Redis 8.0.2 + Valkey 9.1.0
+(hash-TTL surface, the only workload where all three were head-to-head
+benchmarked). Co-located client and server, pipeline depth tuned per
+row. Full methodology and reproduction steps in
+[BENCHMARK.md](BENCHMARK.md) and [docs/benchmarks.mdx](docs/benchmarks.mdx).
+Valkey peak-throughput columns are intentionally blank — a head-to-head
+peak-RPS bench on identical hardware has not yet been run; the
+[Moon vs Redis vs Valkey](#moon-vs-redis-vs-valkey) section quotes
+Valkey's vendor-published 2.1M RPS (9 I/O threads, p=10) for context.
 
 ### Peak throughput (GCloud c3-standard-8, x86_64, monoio io_uring)
 
-| Workload                         |   Moon | Redis |  Ratio |
-|----------------------------------|-------:|------:|:------:|
-| Peak GET (c=50, p=64)            | 5.11M  | 2.98M | **1.72×** |
-| Peak SET (c=50, p=64)            | 3.50M  | 1.82M | **1.92×** |
-| GET, production defaults (AOF+disk-offload) | 4.76M | 2.46M | **1.93×** |
-| GET, max durability (fsync always)| 4.85M  | 2.45M | **1.98×** |
-| Memory, values ≥ 1 KB            | —      | —     | **27–35% less** |
-| Crash recovery (SIGKILL, 5K keys)| 100%   | 100%  | parity |
+| Workload                                       |   Moon | Redis 8.6.1 | Valkey 9.1.0 |
+|------------------------------------------------|-------:|------------:|-------------:|
+| Peak GET (c=50, p=64)                          | 5.11M  | 2.98M (1.72×) | not yet benched |
+| Peak SET (c=50, p=64)                          | 3.50M  | 1.82M (1.92×) | not yet benched |
+| GET, production defaults (AOF + disk-offload)  | 4.76M  | 2.46M (1.93×) | not yet benched |
+| GET, max durability (`fsync=always`)           | 4.85M  | 2.45M (1.98×) | not yet benched |
+| Memory, values ≥ 1 KB                          | —      | **27–35 % less** | not yet benched* |
+| Crash recovery (SIGKILL, 5K keys)              | 100 %  | 100 % (parity)| 100 % (parity, vendor-claimed) |
+
+*Valkey 9.1 raised the embstr threshold to 128 B; below ~64 B Valkey
+9.1 may be tighter than Moon. A head-to-head re-bench across the value
+size curve is on the v0.2 roadmap.
 
 ### ARM64 (GCloud t2a-standard-8, Neoverse-N1)
 
@@ -98,9 +132,13 @@ Measured vs Redis 8.6.1, co-located client and server, pipeline depth tuned per
 | Native API QPS     | **19×**  | N/A     |
 | Bulk insert        | **23×**  | 1×      |
 
-### Hash-field TTL — Valkey 9.0/9.1 parity (OrbStack moon-dev, n=200K c=50, median of 3)
+### Hash-field TTL — Valkey 9.0/9.1 parity (v0.2.0-alpha preview)
 
-Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodology + 26 scenarios in [docs/perf/2026-05-27-hash-ttl-3way-bench.md](docs/perf/2026-05-27-hash-ttl-3way-bench.md); reproducible via [scripts/bench-hash-ttl-3way.sh](scripts/bench-hash-ttl-3way.sh).
+> **Status:** ships in **v0.2.0-alpha** (currently on `main`, no tagged
+> release yet). Not present in the latest tagged build `v0.1.12`. Build
+> from `main` to reproduce these numbers.
+
+OrbStack moon-dev, n=200K c=50, median of 3. Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodology + 26 scenarios in [docs/perf/2026-05-27-hash-ttl-3way-bench.md](docs/perf/2026-05-27-hash-ttl-3way-bench.md); reproducible via [scripts/bench-hash-ttl-3way.sh](scripts/bench-hash-ttl-3way.sh).
 
 | Command          | Pipeline | Moon  | Redis 8.0.2 | Valkey 9.1.0 |
 |------------------|----------|------:|------------:|-------------:|
@@ -111,7 +149,64 @@ Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodol
 | `HGETEX EX`      | p=1      | 250K  | N/A         | 251K         |
 | `HGETEX` no-mode | p=1      | 250K  | N/A         | 253K         |
 
-Plain Hash HGET p=16 ties Redis (1.01×) and Valkey (1.00×). HEXPIRE-family Moon vs Valkey: 0.90–0.99× across the surface (Valkey leads HEXPIRE p=16 by 7%, HTTL p=16 by 10%; HGETEX hits 0.99× parity). Redis 8.x has no HEXPIRE-family — Moon is the only Redis-compatible alternative aside from Valkey. The internal `HashWithTtl` HGET / HLEN paths use a cached `min_expiry_ms` for an O(1) fast path that brings them to 1.03× of plain `Hash` (was 80× slower pre-fix; see PR #126).
+Plain Hash HGET p=16 ties Redis (1.01×) and Valkey (1.00×). HEXPIRE-family Moon vs Valkey: 0.90–0.99× across the surface (Valkey leads HEXPIRE p=16 by 7 %, HTTL p=16 by 10 %; HGETEX hits 0.99× parity). Redis 8.x has no HEXPIRE-family — Moon is the only Redis-compatible alternative aside from Valkey. The internal `HashWithTtl` HGET / HLEN paths use a cached `min_expiry_ms` for an O(1) fast path that brings them to 1.03× of plain `Hash` (was 80× slower pre-fix; see PR #126).
+
+## Moon vs Redis vs Valkey
+
+Three Redis-protocol-compatible servers, three different bets. Moon
+competes on a **vertical moat** — thread-per-core architecture and an
+AI-native in-core surface. Valkey competes on a **horizontal moat** —
+Linux Foundation governance, every major cloud, drop-in compatibility.
+Redis OSS continues as the upstream reference but ships under SSPL since
+March 2024. The deep architectural review is in
+[`docs/comparison-valkey.md`](docs/comparison-valkey.md) (~22 KB,
+traced to source).
+
+| Dimension                            | **Moon v0.1.12 / 0.2.0-alpha** | **Valkey 9.1.0**            | **Redis 8.6.1 (OSS)**     |
+|--------------------------------------|--------------------------------|-----------------------------|----------------------------|
+| Language / license                   | Rust 2024 / Apache-2.0         | C99 / BSD-3-Clause (LF TSC) | C99 / **SSPL** since 2024 |
+| Threading model                      | Thread-per-core, shared-nothing | Single main thread + ≤9 I/O threads | Single-threaded core (+ I/O threads) |
+| I/O driver (Linux)                   | io_uring (`monoio`)            | epoll only                  | epoll only                |
+| Snapshot                             | **Forkless** (segment-level COW) | `fork()` + COW              | `fork()` + COW            |
+| AOF / WAL                            | Per-shard WAL v3 + multi-part AOF | Single global AOF         | Single global AOF         |
+| Tiered NVMe disk offload             | **Yes** (under `maxmemory`)    | No (OSS)                    | No (OSS — Redis Flash is Enterprise) |
+| Vector search                        | **In-core** HNSW + TurboQuant 1-8 bit | `valkey-search` module, FP32 only | RediSearch module |
+| Full-text BM25                       | **In-core**                    | `valkey-search` module      | RediSearch module         |
+| Property graph (Cypher)              | **In-core** 14 `GRAPH.*` cmds  | None                        | None (FalkorDB separate)  |
+| Cross-store ACID                     | `TXN.BEGIN/COMMIT/ABORT`       | None                        | None                      |
+| Hash-field TTL (`HEXPIRE`-family)    | **Yes** (Valkey-parity, v0.2-alpha) | **Yes** (9.0+)         | No                        |
+| PITR + CDC                           | `--recovery-target-lsn` + `CDC.READ` (v0.2-alpha) | None | None |
+| Embedded web console                 | **Yes** (7-view React, in-binary) | Valkey Admin GUI 1.0 (separate) | Redis Insight (separate) |
+| Managed cloud offerings              | None (yet)                     | AWS, GCP, Oracle, Aiven, … | Redis Cloud (vendor)      |
+| Multi-node cluster, soak-tested      | **v0.2 alpha** (single-node GA today) | **Production**         | **Production**            |
+| Atomic slot migration                | Planned (v0.2)                 | Yes (9.0)                   | No                        |
+| Peak single-server throughput        | **5.11M GET/s** (c3-8 x86_64)  | 2.1M RPS (9 I/O threads, p=10, vendor) | 2.98M GET/s (c3-8, same harness as Moon) |
+
+### When to choose Moon
+
+- Single-node Redis-compatible workloads where peak throughput,
+  memory efficiency at ≥1 KB values, or **forkless snapshots** matter.
+- AI-native applications: vector search, GraphRAG, semantic cache,
+  hybrid BM25 + dense + sparse retrieval — all in one binary, no module
+  loader, with cross-store ACID across KV / vector / graph.
+- Workloads that benefit from **tiered NVMe offload** under `maxmemory`
+  instead of LRU-eviction-then-rebuild.
+
+### When to choose Valkey
+
+- Multi-node clusters with proven 1000+ node operational mileage.
+- Managed-cloud-only deployments (every major cloud offers Valkey).
+- Strict drop-in compatibility with the Redis 7.2 module ecosystem
+  (`valkey-json`, `valkey-bloom`, `valkey-search`, `valkey-ldap`).
+- Risk-averse environments where Linux Foundation governance is a
+  procurement requirement.
+
+### When to stay on Redis OSS
+
+- Existing investments in RediSearch / RedisJSON / RedisBloom under
+  RLEC, or pre-SSPL-tolerance OSS Redis.
+- Specific Redis Enterprise features (CRDT active-active, Redis Flash)
+  that Moon and Valkey OSS do not match.
 
 ## Quick start
 
@@ -148,9 +243,9 @@ OK
 "world"
 127.0.0.1:6379> HSET user:1 name Alice age 30
 (integer) 2
-127.0.0.1:6379> HEXPIRE user:1 3600 FIELDS 1 age   # Valkey 9.0 per-field TTL
+127.0.0.1:6379> HEXPIRE user:1 3600 FIELDS 1 age   # Valkey 9.0 per-field TTL (v0.2.0-alpha; build from `main`)
 1) (integer) 1
-127.0.0.1:6379> HTTL user:1 FIELDS 1 age
+127.0.0.1:6379> HTTL user:1 FIELDS 1 age           # v0.2.0-alpha
 1) (integer) 3600
 127.0.0.1:6379> FT.CREATE idx ON HASH PREFIX 1 doc: SCHEMA emb VECTOR HNSW 6 DIM 384 TYPE FLOAT32 DISTANCE_METRIC COSINE
 OK
@@ -326,34 +421,123 @@ cargo flamegraph --bin moon -- --port 6399 --shards 1
 
 Contribution guide and coding rules (unsafe policy, hot-path allocation rules, lock discipline) are in [CLAUDE.md](CLAUDE.md) and [UNSAFE_POLICY.md](UNSAFE_POLICY.md).
 
-## Roadmap
-
-Moon is pre-1.0 and **experimental**. Current focus:
-
-- Correctness parity with Redis 8.x across the full command surface
-- AI-native primitives: session dedup, hybrid vector+sparse search, agentic caching
-- Multi-node clustering with gossip, slot migration, and PSYNC2 replication
-- GPU-accelerated vector search (CUDA, feature-gated)
-- Production hardening and SLO validation (see [docs/PRODUCTION-CONTRACT.md](docs/PRODUCTION-CONTRACT.md))
-
-Completed in v0.1.0–v0.1.8:
-- Tiered disk offload (RAM → NVMe) with 100% crash recovery
-- In-process vector search (HNSW + TurboQuant 4/8-bit) with `FT.*` API
-- BM25 full-text search with three-way hybrid fusion (BM25 + dense + sparse)
-- Property graph engine with Cypher subset (14 `GRAPH.*` commands)
-- Cross-store ACID transactions (`TXN.BEGIN`/`COMMIT`/`ABORT`) across KV, vector, and graph
-- Workspace partitioning for multi-tenant namespace isolation
-- Durable message queues with dead-letter and debounced triggers
-- Bi-temporal MVCC for point-in-time KV and graph queries
-- Web console (7 views, embedded in binary)
-- macOS support (aarch64 + x86_64, both runtimes)
-- Thread-per-core dispatch optimization (5.11M GET/s on x86_64)
-
-Production readiness is **not** a v0.1 goal. Storage formats, APIs, and config flags may change between releases.
-
-## Production Readiness
-
-Moon's v1.0 promises — SLOs, durability modes, supported platforms, and a machine-checkable GA exit-criteria checklist — live in **[docs/PRODUCTION-CONTRACT.md](docs/PRODUCTION-CONTRACT.md)**. The contract is the single source of truth every v0.1.3 hardening phase tests against.
+## Production readiness
+
+Honest matrix of where Moon is today (v0.1.12 GA + v0.2.0-alpha
+unreleased). Read alongside
+[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) (the
+machine-checkable GA exit criteria) and
+[`docs/OPERATOR-GUIDE.md`](docs/OPERATOR-GUIDE.md) (memory accounting,
+sizing, runbooks).
+
+### Recommended for production today
+
+- **Single-node deployments** — Linux aarch64 (Tier 1) or Linux x86_64
+  (Tier 2). `--shards N` master, one process, one node.
+- **Read replication** — `--shards 1` master with any `--shards N`
+  replica topology. Single-shard PSYNC2 is wired end-to-end since
+  v0.1.10.
+- **AI workloads** — vector search (HNSW + TurboQuant), BM25 full-text,
+  GraphRAG, semantic caching, hybrid retrieval. All in-core, all
+  RDB / WAL durable, all crash-recovery validated.
+- **Cache + feature store** — durability modes are honest
+  (`always` / `everysec` / `no` with documented recovery bounds), forkless
+  snapshots remove the Redis fork-COW RSS spike, tiered NVMe offload
+  under `maxmemory` keeps working sets larger than RAM.
+- **Crash recovery** — 100 % survived across 7 persistence
+  configurations and 5 K-key SIGKILL workloads. RDB v2 + WAL v3 +
+  multi-part AOF + tiered cold tier all participate.
+
+### Not yet GA — avoid for production
+
+- **Multi-node clustering** (16 K-slot gossip, MOVED/ASK, failover) —
+  protocol-compatible code exists but **PSYNC2 atomic slot migration
+  is not soak-tested**. Valkey 9.0 shipped this; Moon has not.
+  Scheduled for v0.2.
+- **Multi-shard master PSYNC** — single-shard only today; multi-shard
+  master replication is RFC'd in
+  [`.planning/rfcs/multi-shard-replication-design.md`](.planning/rfcs/multi-shard-replication-design.md).
+- **PITR live-snapshot LSN wiring** (P3c) and **`CDC.SUBSCRIBE` push
+  channel** (C3b) — `CDC.READ` polling is alpha-ready; push and
+  zero-snapshot PITR are deferred to v0.2 follow-ups.
+- **GPU vector acceleration** (`gpu-cuda` feature) — kernel scaffold
+  exists; production kernels not yet shipping.
+- **macOS native** — first-class development platform with full feature
+  set minus io_uring, but production deployments should target Linux
+  per
+  [`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) tiers.
+- **Performance SLO numbers in
+  [`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md)** — marked
+  `[provisional]` until the Phase 97 24-h HDR-histogram rig validates
+  them on reference hardware. Use the benchmarks above as point-in-time
+  measurements, not committed SLOs.
+
+### Operator gotchas worth knowing before you deploy
+
+- **Multi-shard scaling needs load.** Aim for `clients ≥ 25 × shards`
+  on pipeline ≥ 16 workloads; below that, multiple shards under-subscribe
+  the dispatch loop and a single shard wins. Random keyspaces with
+  `p=1` benefit less than `{tag}`-co-located keys (see
+  [`CLAUDE.md`](CLAUDE.md) "Gotchas" + the v0.1.12 multi-shard memo).
+- **Fairness flags for benchmarking against Redis / Valkey** —
+  `--disk-offload disable --appendonly no` removes Moon's durability
+  overhead (~26 % on SET p=64) so comparisons are apples-to-apples.
+- **Memory accounting** — bill on RSS, not VSZ. `MEMORY DOCTOR`,
+  `moon_memory_bytes{kind=…}` Prometheus gauge, and
+  [`docs/OPERATOR-GUIDE.md#memory-accounting`](docs/OPERATOR-GUIDE.md#memory-accounting)
+  cover the full VSZ-vs-RSS guide and tuning knobs
+  (`--memory-arenas-cap`, `mimalloc-alt`).
+- **RDB hash-field TTL trailer** is RDB v2 only — older v1 readers stop
+  after the hash body and silently drop per-field TTLs. Pin storage
+  format to `v1` (the umbrella covering v2/v3 sub-formats) per
+  [`docs/STORAGE-FORMAT-V1.md`](docs/STORAGE-FORMAT-V1.md).
+
+### Roadmap
+
+| Milestone        | Focus                                                                                       | Status      |
+|------------------|---------------------------------------------------------------------------------------------|-------------|
+| **v0.2.0** (next) | Multi-node clustering soak (PSYNC2 + atomic slot migration); PITR P3c; CDC push (`SUBSCRIBE`) | alpha       |
+| **v0.2.x**       | GPU vector acceleration (`gpu-cuda`); operator runbooks; full SLO lock-in (`PERF-01..05`)    | planned     |
+| **v1.0**         | Every [`PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) GA exit-criteria box ticked    | gate        |
+
+What's already in `main` (v0.1.0 → v0.2.0-alpha, 14 months of work):
+
+> Hash-field TTL and PITR + CDC ship in **v0.2.0** — currently alpha on
+> `main`, no tagged release yet. **v0.1.12** is the latest tag and does
+> NOT include them. Everything else below is in `v0.1.12` GA.
+
+**Shipped in v0.1.12 (latest tag, single-node production-grade):**
+
+- Forkless persistence (RDB v2 + per-shard WAL v3 + multi-part AOF).
+- Tiered disk offload (RAM → NVMe) with 100 % crash recovery.
+- In-process vector search (HNSW + TurboQuant 1–8-bit) — `FT.*` surface.
+- BM25 full-text search + three-way RRF hybrid fusion (BM25 + dense + sparse).
+- Property graph engine with Cypher subset (14 `GRAPH.*` commands).
+- Cross-store ACID (`TXN.BEGIN` / `COMMIT` / `ABORT`) across KV, vector,
+  graph.
+- Workspaces, durable message queues, bi-temporal MVCC.
+- Web console (7-view React app, embedded in binary).
+- Thread-per-core dispatch optimization (5.11M GET/s on x86_64).
+
+**Added in v0.2.0-alpha (on `main`, untagged):**
+
+- Hash-field TTL — Valkey 9.0 / 9.1 parity, O(1) HGET / HLEN fast path
+  (`HEXPIRE`, `HEXPIREAT`, `HPEXPIRE`, `HPEXPIREAT`, `HEXPIRETIME`,
+  `HPEXPIRETIME`, `HTTL`, `HPTTL`, `HPERSIST`, `HGETEX`, `HGETDEL`).
+- PITR — `--recovery-target-lsn` deterministic WAL replay-to-LSN.
+- CDC — `CDC.READ` pull-mode change stream (push-mode planned for v0.2.0
+  GA).
+- Multi-node clustering soak (PSYNC2 + atomic slot migration).
+
+## Production Readiness Contract
+
+Moon's v1.0 promises — per-command-class SLOs, durability modes, supported
+platforms, security guarantees, and a machine-checkable GA exit-criteria
+checklist — live in
+**[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md)**. Every
+v0.1.3+ hardening phase ticks off items on that checklist; nothing
+promotes to `v1.0-rc1` until every box is green. The contract is the
+single source of truth for what Moon owes you in production.
 
 ## Credits
 
diff --git a/docs/runbooks/multi-shard-aof-rewrite.md b/docs/runbooks/multi-shard-aof-rewrite.md
new file mode 100644
index 00000000..9c33b609
--- /dev/null
+++ b/docs/runbooks/multi-shard-aof-rewrite.md
@@ -0,0 +1,97 @@
+# Runbook — multi-shard + AOF refused at startup
+
+**Status:** active (Moon ≥ v0.1.13). Lifted in v2.0 once multi-shard
+AOF replay walks every shard's segment manifest on recovery.
+
+## What you saw
+
+### At startup
+
+```
+REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss
+bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1,
+or pass --appendonly no for cache-only deployments, or pass
+--unsafe-multishard-aof to acknowledge the risk and start anyway. See
+docs/runbooks/multi-shard-aof-rewrite.md.
+```
+
+### Or at command time (defence-in-depth, fires under any escape-hatch)
+
+```
+> BGREWRITEAOF
+(error) ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable
+        + --appendonly yes (known data-loss bug; see
+        docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1, set
+        --disk-offload disable, or wait for v2.0 multi-part AOF replay.
+```
+
+## Why this gate exists
+
+Verified on `main` at commit `6e49050` (2026-05-26), reproducers in
+[`tmp/p0-no-rewrite.sh`](../../tmp/p0-no-rewrite.sh),
+[`tmp/p0-always.sh`](../../tmp/p0-always.sh),
+[`tmp/p0-multishard-no-offload.sh`](../../tmp/p0-multishard-no-offload.sh):
+
+| Configuration                                                              | Result                       |
+|----------------------------------------------------------------------------|------------------------------|
+| `--shards 1 --appendonly yes --appendfsync always` (control)               | ✅ Recovers 5000 / 5000       |
+| `--shards 1 --disk-offload enable --appendonly yes` (control)              | ✅ Recovers 12 714 / 12 714   |
+| `--shards 2 --disk-offload enable --appendonly yes --appendfsync everysec` | ❌ Loses 38 % (12 662 → 7 892) |
+| `--shards 2 --disk-offload enable --appendonly yes --appendfsync always`   | ❌ Loses 50 % (5000 → 2474)   |
+| `--shards 2 --disk-offload disable --appendonly yes --appendfsync always`  | ❌ Loses 50 % (5000 → 2453)   |
+
+**The bug is in the multi-shard AOF durability path itself**, not the
+rewrite path and not the disk-offload tier. `--appendfsync always` and
+`--disk-offload disable` do not save you — only `--shards 1` does.
+
+The rewrite-specific gate (the `BGREWRITEAOF` error above) is still
+shipped as defence-in-depth for anyone who passes
+`--unsafe-multishard-aof`.
+
+## How to recover from a triggered loss (if you hit this on < v0.1.13)
+
+1. If a recent RDB snapshot exists in `--dir`, stop the server, move
+   `appendonlydir/` aside, and let recovery rebuild from RDB +
+   surviving per-shard WAL only. RPO equals the time since the RDB
+   snapshot.
+2. If replication was running, promote a non-affected replica
+   (`REPLICAOF NO ONE`) and re-sync the affected node.
+3. If neither: data is lost. File a `P0` with the AOF manifest
+   contents (`appendonlydir/moon.aof.manifest`) and per-shard WAL
+   sizes.
+
+## How to avoid the gate
+
+Pick whichever matches your deployment:
+
+| Option                                             | Trade-off                                                                                |
+|----------------------------------------------------|------------------------------------------------------------------------------------------|
+| `--shards 1`                                       | **Recommended.** Best throughput on non-pipelined workloads; gives up multi-shard fan-out |
+| `--appendonly no`                                  | Cache-only deployments; durability falls back to `save` rules + RDB recovery             |
+| `--unsafe-multishard-aof`                          | **Discouraged.** Acknowledges ~50 % loss on crash; suitable only for ephemeral caches    |
+
+The first option also clears the `BGREWRITEAOF` defence-in-depth gate.
+
+## When will this be removed?
+
+v2.0 ships multi-shard AOF replay that walks every shard's segment
+manifest on recovery. Both gates (startup refusal + `BGREWRITEAOF`
+error) are removed at the same time. Track progress at
+[`tmp/SHIP-PLAN-v1.0-rc1-single-node.md`](../../tmp/SHIP-PLAN-v1.0-rc1-single-node.md)
+§ Track B.
+
+## Telemetry
+
+When `--unsafe-multishard-aof` is passed AND the suspect config is set,
+the BGREWRITEAOF-specific gate also logs at startup at `WARN`:
+
+```
+BGREWRITEAOF gated for this config (known data-loss path; see
+docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or
+--disk-offload disable to re-enable rewrite.
+```
+
+Each gated `BGREWRITEAOF` invocation also returns the documented `ERR`
+line at the wire, so any operator dashboard tailing `slowlog` or
+client-side error counters will surface the refusal immediately. A
+dedicated Prometheus gauge for both gates is on the v1.0-rc1 backlog.
diff --git a/src/command/persistence.rs b/src/command/persistence.rs
index 8ca5b48e..10980fe3 100644
--- a/src/command/persistence.rs
+++ b/src/command/persistence.rs
@@ -7,9 +7,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
 use bytes::Bytes;
 use tracing::{error, info};
 
-use crate::runtime::channel;
-
-use crate::persistence::aof::AofMessage;
+use crate::persistence::aof::{AofMessage, AofPoolSendError, AofWriterPool};
 use crate::persistence::rdb;
 use crate::protocol::Frame;
 use crate::storage::Database;
@@ -33,6 +31,21 @@ pub static BGSAVE_SHARDS_REMAINING: AtomicU64 = AtomicU64::new(0);
 /// Whether the last BGSAVE completed successfully.
 pub static BGSAVE_LAST_STATUS: AtomicBool = AtomicBool::new(true);
 
+/// Process-wide gate set at startup when the configuration combination
+/// `shards >= 2 + --disk-offload enable + --appendonly yes` is selected.
+///
+/// `BGREWRITEAOF` under this combination silently truncates the WAL of every
+/// shard except the rewriter's own shard while the consolidated multi-part AOF
+/// base RDB written by the rewrite is **not** consumed on restart (verified
+/// 2026-05-26 against HEAD `6e49050`: 38 % data loss reproducible). Until the
+/// v2.0 multi-part AOF replay walks every shard's segment manifest, the only
+/// safe behavior is to refuse the command in this config and point operators
+/// at the runbook.
+///
+/// Set once in `main.rs` after CLI parsing; never cleared. Checked by
+/// `bgrewriteaof_start_sharded` before dispatching the rewrite message.
+pub static MULTI_SHARD_AOF_REWRITE_UNSAFE: AtomicBool = AtomicBool::new(false);
+
 /// Global flag indicating whether a BGREWRITEAOF rewrite is currently in progress.
 ///
 /// Set to `true` when `bgrewriteaof_start` or `bgrewriteaof_start_sharded` dispatches
@@ -208,14 +221,31 @@ pub fn bgsave_shard_done(success: bool) {
     }
 }
 
+/// Translate an `AofWriterPool` send failure into a user-facing RESP error.
+/// Under PerShard layout, `pool.try_send_rewrite` returns
+/// `RewriteUnsupportedInPerShard` — the per-shard rewrite path lands in
+/// step 6 of the per-shard AOF RFC. Until then BGREWRITEAOF refuses with
+/// a stable error rather than silently no-op'ing.
+fn rewrite_pool_error_frame(err: AofPoolSendError) -> Frame {
+    match err {
+        AofPoolSendError::RewriteUnsupportedInPerShard => Frame::Error(Bytes::from_static(
+            b"ERR BGREWRITEAOF is not yet supported under per-shard AOF layout; per-shard rewrite ships in step 6 of the per-shard AOF migration",
+        )),
+        AofPoolSendError::SendFailed => Frame::Error(Bytes::from_static(
+            b"ERR Background AOF rewrite failed to start",
+        )),
+    }
+}
+
 /// Start a background AOF rewrite (BGREWRITEAOF command).
 ///
-/// Sends a Rewrite message to the AOF writer task, which will generate
-/// synthetic commands from current database state and replace the AOF file.
+/// Submits a Rewrite message through the writer pool, which generates
+/// synthetic commands from current database state and replaces the AOF
+/// file.
 ///
 /// Uses CAS to set `AOF_REWRITE_IN_PROGRESS`: if a rewrite is already running,
 /// returns an error immediately without corrupting the in-flight rewrite state.
-pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender<AofMessage>, db: SharedDatabases) -> Frame {
+pub fn bgrewriteaof_start(pool: &AofWriterPool, db: SharedDatabases) -> Frame {
     // CAS: only proceed if currently false; prevents a second caller from
     // clearing the flag while the first rewrite is still in progress.
     if AOF_REWRITE_IN_PROGRESS
@@ -226,16 +256,15 @@ pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender<AofMessage>, db: SharedDa
             b"ERR Background AOF rewrite already in progress",
         ));
     }
-    match aof_tx.try_send(AofMessage::Rewrite(db)) {
+    match pool.try_send_rewrite(AofMessage::Rewrite(db)) {
         Ok(()) => Frame::SimpleString(Bytes::from_static(
             b"Background append only file rewriting started",
         )),
-        Err(_) => {
-            // Channel send failed — rewrite never started; we set the flag so we clear it.
+        Err(e) => {
+            // Send failed (channel full) or PerShard rejection — rewrite never
+            // started, so clear the in-progress flag we just set.
             AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
-            Frame::Error(Bytes::from_static(
-                b"ERR Background AOF rewrite failed to start",
-            ))
+            rewrite_pool_error_frame(e)
         }
     }
 }
@@ -245,9 +274,18 @@ pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender<AofMessage>, db: SharedDa
 /// Uses CAS to set `AOF_REWRITE_IN_PROGRESS`: if a rewrite is already running,
 /// returns an error immediately without corrupting the in-flight rewrite state.
 pub fn bgrewriteaof_start_sharded(
-    aof_tx: &channel::MpscSender<AofMessage>,
+    pool: &AofWriterPool,
     shard_databases: std::sync::Arc<crate::shard::shared_databases::ShardDatabases>,
 ) -> Frame {
+    // Refuse the rewrite under the known-unsafe config combo (see the
+    // MULTI_SHARD_AOF_REWRITE_UNSAFE doc comment).  This is the
+    // single-node v1.0-rc1 gate; the v2.0 multi-part AOF replay fix lifts
+    // it.
+    if MULTI_SHARD_AOF_REWRITE_UNSAFE.load(Ordering::Relaxed) {
+        return Frame::Error(Bytes::from_static(
+            b"ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable + --appendonly yes (known data-loss bug; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1, set --disk-offload disable, or wait for v2.0 multi-part AOF replay.",
+        ));
+    }
     // CAS: only proceed if currently false; prevents a second caller from
     // clearing the flag while the first rewrite is still in progress.
     if AOF_REWRITE_IN_PROGRESS
@@ -258,16 +296,15 @@ pub fn bgrewriteaof_start_sharded(
             b"ERR Background AOF rewrite already in progress",
         ));
     }
-    match aof_tx.try_send(AofMessage::RewriteSharded(shard_databases)) {
+    match pool.try_send_rewrite(AofMessage::RewriteSharded(shard_databases)) {
         Ok(()) => Frame::SimpleString(Bytes::from_static(
             b"Background append only file rewriting started",
         )),
-        Err(_) => {
-            // Channel send failed — rewrite never started; we set the flag so we clear it.
+        Err(e) => {
+            // Send failed (channel full) or PerShard rejection — rewrite never
+            // started, so clear the in-progress flag we just set.
             AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
-            Frame::Error(Bytes::from_static(
-                b"ERR Background AOF rewrite failed to start",
-            ))
+            rewrite_pool_error_frame(e)
         }
     }
 }
@@ -358,4 +395,66 @@ mod tests {
         // Note: verify the static exists and is accessible
         let _ = BGSAVE_LAST_STATUS.load(Ordering::Relaxed);
     }
+
+    // Serialize the multi-shard gate test against any other test mutating
+    // the gate or AOF_REWRITE_IN_PROGRESS (parallel test runner otherwise
+    // races on the process-wide AtomicBools).
+    static GATE_TEST_LOCK: parking_lot::Mutex<()> = parking_lot::Mutex::new(());
+
+    #[test]
+    fn test_bgrewriteaof_sharded_refuses_under_unsafe_config() {
+        let _guard = GATE_TEST_LOCK.lock();
+        // Use a small bounded channel so the test does not need an AOF
+        // writer task; the gate must fire BEFORE try_send is reached.
+        // Wrap as a TopLevel pool to match the post-2e-β helper signature.
+        let (tx, _rx) = crate::runtime::channel::mpsc_bounded::<AofMessage>(1);
+        let pool = AofWriterPool::top_level(tx);
+        let shard_dbs = crate::shard::shared_databases::ShardDatabases::new(
+            vec![vec![crate::storage::Database::new()]],
+        );
+
+        // Snapshot prior state so the test is order-independent.
+        let prior = MULTI_SHARD_AOF_REWRITE_UNSAFE.load(Ordering::Relaxed);
+        let prior_in_progress = AOF_REWRITE_IN_PROGRESS.load(Ordering::SeqCst);
+        AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+
+        // Gate ON → must refuse with the documented ERR (and must NOT flip
+        // AOF_REWRITE_IN_PROGRESS, otherwise a normal rewrite gets blocked).
+        MULTI_SHARD_AOF_REWRITE_UNSAFE.store(true, Ordering::Relaxed);
+        let frame = bgrewriteaof_start_sharded(&pool, shard_dbs.clone());
+        match frame {
+            Frame::Error(msg) => {
+                let s = std::str::from_utf8(&msg).unwrap();
+                assert!(
+                    s.contains("BGREWRITEAOF is unsafe")
+                        && s.contains("multi-shard-aof-rewrite.md"),
+                    "unexpected error: {s}"
+                );
+            }
+            other => panic!("expected Frame::Error, got {other:?}"),
+        }
+        assert!(
+            !AOF_REWRITE_IN_PROGRESS.load(Ordering::SeqCst),
+            "gate must not set AOF_REWRITE_IN_PROGRESS"
+        );
+
+        // Gate OFF → the gate error must NOT fire. (Without an AOF writer
+        // task draining the channel, the second call may succeed or return
+        // "failed to start" depending on buffer state; the contract under
+        // test here is only that the gate error is gone.)
+        MULTI_SHARD_AOF_REWRITE_UNSAFE.store(false, Ordering::Relaxed);
+        AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+        let frame2 = bgrewriteaof_start_sharded(&pool, shard_dbs);
+        if let Frame::Error(msg) = &frame2 {
+            let s = std::str::from_utf8(msg).unwrap();
+            assert!(
+                !s.contains("BGREWRITEAOF is unsafe"),
+                "gate error fired with gate off: {s}"
+            );
+        }
+
+        // Restore prior state.
+        AOF_REWRITE_IN_PROGRESS.store(prior_in_progress, Ordering::SeqCst);
+        MULTI_SHARD_AOF_REWRITE_UNSAFE.store(prior, Ordering::Relaxed);
+    }
 }
diff --git a/src/config.rs b/src/config.rs
index b384a818..8ec55ad4 100644
--- a/src/config.rs
+++ b/src/config.rs
@@ -102,6 +102,17 @@ pub struct ServerConfig {
     #[arg(long, default_value = "yes")]
     pub appendonly: String,
 
+    /// Acknowledge the known multi-shard AOF durability bug and start
+    /// anyway.  Verified 2026-05-26 on HEAD `6e49050`:
+    /// `--shards >= 2 + --appendonly yes` loses ~50 % of writes on SIGKILL
+    /// regardless of `--appendfsync` or `--disk-offload` settings.  Until
+    /// the v2.0 multi-shard AOF replay lands, Moon refuses this config at
+    /// startup; pass this flag to override (e.g. cache-only deployments
+    /// where the loss is acceptable).  See
+    /// `docs/runbooks/multi-shard-aof-rewrite.md`.
+    #[arg(long, default_value_t = false)]
+    pub unsafe_multishard_aof: bool,
+
     /// AOF fsync policy (always/everysec/no)
     #[arg(long, default_value = "everysec")]
     pub appendfsync: String,
diff --git a/src/main.rs b/src/main.rs
index 8bff34ac..0e71efc5 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -44,7 +44,7 @@ use std::path::PathBuf;
 
 use clap::Parser;
 use moon::config::ServerConfig;
-use moon::persistence::aof::{self, AofMessage, FsyncPolicy};
+use moon::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy};
 use moon::runtime::cancel::CancellationToken;
 use moon::runtime::channel;
 use moon::runtime::{RuntimeFactoryImpl, traits::RuntimeFactory};
@@ -270,6 +270,24 @@ fn main() -> anyhow::Result<()> {
 
     info!("Starting with {} shards", num_shards);
 
+    // P0-FIX-01b: refuse to start under the known durability bug
+    // (`shards >= 2 + appendonly yes` loses ~50 % of writes on SIGKILL,
+    //  verified 2026-05-26 on HEAD `6e49050`; reproducer in
+    //  `tmp/p0-no-rewrite.sh` and `tmp/p0-always.sh`).  The bug is
+    // independent of `--appendfsync` and `--disk-offload` settings.  An
+    // operator can override via `--unsafe-multishard-aof` if the
+    // deployment is cache-only and the loss window is acceptable.
+    if num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof {
+        eprintln!(
+            "REFUSING TO START: --shards {num_shards} + --appendonly yes has a known data-loss \
+             bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, or pass \
+             --appendonly no for cache-only deployments, or pass --unsafe-multishard-aof to \
+             acknowledge the risk and start anyway. See \
+             docs/runbooks/multi-shard-aof-rewrite.md."
+        );
+        std::process::exit(2);
+    }
+
     // T1.1: warn when maxclients < 25 × shards (undersubscription footgun).
     // Suppressed by MOON_NO_UNDERSUBSCRIPTION_WARN=1.
     if let Some(msg) = should_warn_undersubscription(config.maxclients, num_shards)
@@ -287,30 +305,115 @@ fn main() -> anyhow::Result<()> {
     // Collect connection senders for the listener before spawning shard threads
     let conn_txs: Vec<_> = (0..num_shards).map(|i| mesh.conn_tx(i)).collect();
 
-    // Set up AOF channel: single writer, all shards send to it via mpsc::Sender clones.
-    // The AOF writer task will be spawned on the listener runtime.
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = if config.appendonly == "yes" {
-        let (tx, rx) = channel::mpsc_bounded::<AofMessage>(10_000);
-        let aof_token = cancel_token.child_token();
-        let fsync = FsyncPolicy::from_str(&config.appendfsync);
-        let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename);
-        // AOF writer task will be spawned on the listener runtime (see below)
-        // We store rx to spawn later since listener_rt hasn't been created yet.
-        // Instead, spawn on a dedicated thread so it's available before listener starts.
-        std::thread::Builder::new()
-            .name("aof-writer".to_string())
-            .spawn(move || {
-                RuntimeFactoryImpl::block_on_local(
-                    "aof-writer".to_string(),
-                    aof::aof_writer_task(rx, aof_file_path, fsync, aof_token),
+    // Set up AOF writer channel(s) + `AofWriterPool` (step 2f-β: layout-aware).
+    //
+    // Logic:
+    //   1. If `appendonly == "yes"` and an existing on-disk manifest is found,
+    //      verify its shard count matches `--shards` (RFC § 3 refusal — a
+    //      mismatch silently maps shards to the wrong AOF files and is fatal).
+    //   2. If the manifest's layout is `PerShard` AND `num_shards >= 2`,
+    //      spawn one writer per shard (`aof-writer-{N}` threads) and emit a
+    //      `AofWriterPool::per_shard(senders)`.
+    //   3. Otherwise spawn the single legacy writer and emit
+    //      `AofWriterPool::top_level(tx)`. This includes:
+    //        - no manifest yet (fresh install — `initialize()` only writes
+    //          TopLevel today; fresh-install PerShard creation lands later
+    //          in the RFC sequence)
+    //        - existing TopLevel manifest (legacy v1 or single-shard v2)
+    //        - `num_shards == 1` (always TopLevel; per-shard fan-out has no
+    //          meaning when there is one shard)
+    //
+    // A *corrupt* manifest is fatal — `AofManifest::load` returning `Err(_)`
+    // must NOT silently fall back to TopLevel, because the next write would
+    // create a fresh manifest overwriting the reference to the real base RDB
+    // and lose data. This mirrors the replay block at L514–526.
+    //
+    // Note: nothing today constructs a `layout == PerShard` manifest on disk
+    // (initialize() hardcodes TopLevel, migrate_top_level_to_per_shard is not
+    // yet wired into boot). The PerShard branch is reachable only by a
+    // hand-crafted manifest until step 9 lifts the multi-shard gate. Runtime
+    // behavior under default configurations stays byte-identical to step 2f-α.
+    use moon::persistence::aof_manifest::{AofLayout, AofManifest};
+    let existing_manifest: Option<AofManifest> = if config.appendonly == "yes" {
+        let base_dir = PathBuf::from(&config.dir);
+        match AofManifest::load(&base_dir) {
+            Ok(opt) => opt,
+            Err(e) => {
+                eprintln!(
+                    "REFUSING TO START: AOF manifest at {}/appendonlydir/ is corrupt: {}. \
+                     Inspect manually before deleting; overwriting silently loses data.",
+                    base_dir.display(),
+                    e
                 );
-            })
-            .expect("failed to spawn AOF writer thread");
-        info!(
-            "AOF enabled with fsync policy: {:?}",
-            FsyncPolicy::from_str(&config.appendfsync)
-        );
-        Some(tx)
+                std::process::exit(2);
+            }
+        }
+    } else {
+        None
+    };
+    if let Some(ref m) = existing_manifest
+        && let Err(e) = m.verify_shard_count(num_shards as u16)
+    {
+        eprintln!("REFUSING TO START: {e}");
+        std::process::exit(2);
+    }
+
+    let aof_pool: Option<std::sync::Arc<AofWriterPool>> = if config.appendonly == "yes" {
+        let fsync = FsyncPolicy::from_str(&config.appendfsync);
+        let use_per_shard = matches!(
+            existing_manifest.as_ref().map(|m| m.layout),
+            Some(AofLayout::PerShard)
+        ) && num_shards >= 2;
+
+        if use_per_shard {
+            let base_dir = PathBuf::from(&config.dir);
+            let mut senders = Vec::with_capacity(num_shards);
+            for sid in 0..num_shards {
+                let (tx, rx) = channel::mpsc_bounded::<AofMessage>(10_000);
+                let aof_token = cancel_token.child_token();
+                let base_dir = base_dir.clone();
+                let thread_name = format!("aof-writer-{sid}");
+                let thread_name_inner = thread_name.clone();
+                std::thread::Builder::new()
+                    .name(thread_name)
+                    .spawn(move || {
+                        RuntimeFactoryImpl::block_on_local(
+                            thread_name_inner,
+                            aof::per_shard_aof_writer_task(
+                                rx,
+                                base_dir,
+                                sid as u16,
+                                fsync,
+                                aof_token,
+                            ),
+                        );
+                    })
+                    .expect("failed to spawn per-shard AOF writer thread");
+                senders.push(tx);
+            }
+            info!(
+                "AOF enabled (PerShard, {} writers, fsync: {:?})",
+                num_shards, fsync
+            );
+            Some(AofWriterPool::per_shard(senders))
+        } else {
+            let (tx, rx) = channel::mpsc_bounded::<AofMessage>(10_000);
+            let aof_token = cancel_token.child_token();
+            let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename);
+            // Legacy single-writer thread. Each shard clones the outer
+            // `aof_pool` Arc; sender lifetime is governed by the pool's Drop.
+            std::thread::Builder::new()
+                .name("aof-writer".to_string())
+                .spawn(move || {
+                    RuntimeFactoryImpl::block_on_local(
+                        "aof-writer".to_string(),
+                        aof::aof_writer_task(rx, aof_file_path, fsync, aof_token),
+                    );
+                })
+                .expect("failed to spawn AOF writer thread");
+            info!("AOF enabled (TopLevel, fsync: {:?})", fsync);
+            Some(AofWriterPool::top_level(tx))
+        }
     } else {
         None
     };
@@ -318,6 +421,23 @@ fn main() -> anyhow::Result<()> {
     // Compute bind address for SO_REUSEPORT per-shard listeners (Linux io_uring path).
     let bind_addr = format!("{}:{}", config.bind, config.port);
 
+    // P0-FIX-01: gate BGREWRITEAOF under the known data-loss config combo
+    // (multi-shard + disk-offload enabled + appendonly).  Verified 2026-05-26:
+    // the rewrite truncates non-rewriter shards' WALs and the consolidated
+    // multi-part AOF base RDB is not consumed on restart, losing ~38 % of
+    // keys.  v2.0 multi-part AOF replay lifts this; until then we refuse the
+    // command at dispatch time.  See docs/runbooks/multi-shard-aof-rewrite.md.
+    if num_shards >= 2 && config.disk_offload_enabled() && config.appendonly == "yes" {
+        moon::command::persistence::MULTI_SHARD_AOF_REWRITE_UNSAFE
+            .store(true, std::sync::atomic::Ordering::Relaxed);
+        tracing::warn!(
+            shards = num_shards,
+            disk_offload = %config.disk_offload,
+            appendonly = %config.appendonly,
+            "BGREWRITEAOF gated for this config (known data-loss path; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or --disk-offload disable to re-enable rewrite."
+        );
+    }
+
     // Create watch channel for snapshot triggers (auto-save and BGSAVE)
     let (snapshot_trigger_tx, snapshot_trigger_rx) = moon::runtime::channel::watch(0u64);
 
@@ -644,7 +764,7 @@ fn main() -> anyhow::Result<()> {
         }
         let conn_rx = mesh.take_conn_rx(id);
         let shard_cancel = cancel_token.clone();
-        let shard_aof_tx = aof_tx.clone();
+        let shard_aof_pool = aof_pool.clone();
         let shard_bind_addr = bind_addr.clone();
         let shard_persistence_dir = persistence_dir.clone();
         let shard_snap_rx = snapshot_trigger_rx.clone();
@@ -676,7 +796,7 @@ fn main() -> anyhow::Result<()> {
                             consumers,
                             producers,
                             shard_cancel,
-                            shard_aof_tx,
+                            shard_aof_pool,
                             // Only pass bind_addr for per-shard SO_REUSEPORT when tokio
                             // with io_uring is active. monoio uses central listener MPSC.
                             #[cfg(feature = "runtime-tokio")]
@@ -862,9 +982,11 @@ fn main() -> anyhow::Result<()> {
         });
     }
 
-    // After listener exits, send AOF shutdown and cancel all shards
-    if let Some(ref tx) = aof_tx {
-        let _ = tx.send(AofMessage::Shutdown);
+    // After listener exits, send AOF shutdown to every writer and cancel all shards.
+    // Under TopLevel this is one send; under PerShard (step 2f-β) this fans out to
+    // every per-shard writer thread via `broadcast_shutdown`.
+    if let Some(ref pool) = aof_pool {
+        pool.broadcast_shutdown();
     }
     cancel_token.cancel();
     for handle in shard_handles {
diff --git a/src/persistence/aof.rs b/src/persistence/aof.rs
index a11bb7c5..e97c2910 100644
--- a/src/persistence/aof.rs
+++ b/src/persistence/aof.rs
@@ -57,8 +57,23 @@ impl FsyncPolicy {
 
 /// Messages sent to the AOF writer task via mpsc channel.
 pub enum AofMessage {
-    /// Append serialized RESP command bytes to the AOF file.
-    Append(Bytes),
+    /// Append serialized RESP command bytes to the AOF file, tagged with the
+    /// LSN that was issued for this write (`ReplicationState::issue_lsn`).
+    ///
+    /// `lsn` semantics by writer task:
+    /// - **TopLevel** (`aof_writer_task`): `lsn` is **ignored**; the legacy
+    ///   v1 disk format is plain RESP bytes with no per-entry framing.
+    /// - **PerShard** (`per_shard_aof_writer_task`): `lsn` is **written** as
+    ///   a u64 header per RFC § 2 Rule 1. Disk format per entry:
+    ///   `[u64 lsn LE][u32 len LE][RESP bytes of length len]`.
+    ///   Recovery reads `(lsn, cmd)` pairs and merges cross-shard
+    ///   `OrderedAcrossShards` writes by LSN (RFC § 2 Rule 2).
+    ///
+    /// Construction sites that issue a real LSN call
+    /// `ReplicationState::issue_lsn(shard_id, bytes.len() as u64)` and pass
+    /// the returned value. Sites with no replication state available pass 0
+    /// (TopLevel ignores it; PerShard treats 0 as "no ordering hint").
+    Append { lsn: u64, bytes: Bytes },
     /// Trigger a full AOF rewrite (compaction) using current database state.
     Rewrite(SharedDatabases),
     /// Trigger AOF rewrite in sharded mode (all shards' databases).
@@ -67,6 +82,298 @@ pub enum AofMessage {
     Shutdown,
 }
 
+/// Reasons a pool send may be refused without queueing.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum AofPoolSendError {
+    /// `Rewrite`/`RewriteSharded` sent to a `PerShard` pool. BGREWRITEAOF must
+    /// be issued per shard in the per-shard layout; the legacy single-writer
+    /// rewrite path is not applicable.
+    RewriteUnsupportedInPerShard,
+    /// Underlying channel send failed (writer task dead or channel full).
+    SendFailed,
+}
+
+/// Bundle of per-shard AOF writer senders.
+///
+/// The pool keeps the call-site API uniform regardless of layout:
+/// - **TopLevel** (legacy v1, single-shard, also used for `--shards 1` v2):
+///   exactly one writer thread; every `sender(shard_id)` returns the same
+///   sender so all shards multiplex onto one file.
+/// - **PerShard** (v2 multi-shard): one writer per shard; `sender(shard_id)`
+///   returns the writer that owns `appendonlydir/shard-{shard_id}/`.
+///
+/// Step 2a is additive — this type is defined here but no call site is wired
+/// to it yet. Step 2c performs the type plumbing in `conn_state` and
+/// `conn/core`; steps 2d/2e/2f update the call sites and spawn paths.
+#[derive(Clone)]
+pub struct AofWriterPool {
+    senders: Vec<channel::MpscSender<AofMessage>>,
+    layout: crate::persistence::aof_manifest::AofLayout,
+}
+
+impl AofWriterPool {
+    /// Build a TopLevel pool from a single existing writer sender. Used for
+    /// legacy v1 deployments and `--shards 1` v2 deployments where one writer
+    /// thread services every shard.
+    pub fn top_level(sender: channel::MpscSender<AofMessage>) -> Arc<Self> {
+        Arc::new(Self {
+            senders: vec![sender],
+            layout: crate::persistence::aof_manifest::AofLayout::TopLevel,
+        })
+    }
+
+    /// Build a PerShard pool from N senders. `senders[i]` MUST be the writer
+    /// task that owns `appendonlydir/shard-{i}/`. The vector's length is the
+    /// shard count; passing a length-1 vector here is a bug — use
+    /// [`AofWriterPool::top_level`] instead.
+    pub fn per_shard(senders: Vec<channel::MpscSender<AofMessage>>) -> Arc<Self> {
+        debug_assert!(
+            senders.len() >= 2,
+            "per_shard pool needs >=2 writers; use top_level for single-writer"
+        );
+        Arc::new(Self {
+            senders,
+            layout: crate::persistence::aof_manifest::AofLayout::PerShard,
+        })
+    }
+
+    /// Return the writer sender that owns the given shard's AOF file.
+    ///
+    /// For TopLevel pools, `shard_id` is ignored — all shards multiplex onto
+    /// the single sender. For PerShard pools, `shard_id` MUST be in range
+    /// `[0, num_writers())`; an out-of-range id is a programmer error and
+    /// panics in debug builds.
+    #[inline]
+    pub fn sender(&self, shard_id: usize) -> &channel::MpscSender<AofMessage> {
+        use crate::persistence::aof_manifest::AofLayout;
+        match self.layout {
+            AofLayout::TopLevel => &self.senders[0],
+            AofLayout::PerShard => {
+                debug_assert!(
+                    shard_id < self.senders.len(),
+                    "shard_id {} out of range for per-shard pool of size {}",
+                    shard_id,
+                    self.senders.len()
+                );
+                &self.senders[shard_id]
+            }
+        }
+    }
+
+    /// Fire-and-forget append for the given shard, tagged with the LSN that
+    /// was issued for this write (see [`AofMessage::Append`] docs for LSN
+    /// semantics per layout). Call sites must source `lsn` from
+    /// `ReplicationState::issue_lsn(shard_id, bytes.len() as u64)` for writes
+    /// that participate in replication ordering; sites without a
+    /// replication-state handle pass 0.
+    #[inline]
+    pub fn try_send_append(&self, shard_id: usize, lsn: u64, bytes: Bytes) {
+        let _ = self
+            .sender(shard_id)
+            .try_send(AofMessage::Append { lsn, bytes });
+    }
+
+    /// Issue an LSN for an AOF append at every call site that has the
+    /// `Option<Arc<RwLock<ReplicationState>>>` shape. Wraps
+    /// `ReplicationState::issue_lsn` so handler call sites collapse to a
+    /// single line.
+    ///
+    /// Returns 0 when:
+    /// - `repl_state` is None (test fixtures or shutdown paths)
+    /// - the `RwLock` is poisoned (shouldn't happen in production —
+    ///   ReplicationState is only `write()`-locked under known-safe paths)
+    ///
+    /// 0 is a sentinel meaning "no replication ordering for this write".
+    /// TopLevel writers ignore the LSN entirely so 0 is harmless there;
+    /// PerShard writers treat 0 the same as any other LSN (per-shard order
+    /// is preserved by write order, not by LSN value). The LSN only matters
+    /// for the cross-shard `OrderedAcrossShards` merge in RFC step 5.
+    #[inline]
+    pub fn issue_append_lsn(
+        repl_state: &Option<Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>>,
+        shard_id: usize,
+        delta: usize,
+    ) -> u64 {
+        repl_state
+            .as_ref()
+            .and_then(|rs| {
+                rs.read()
+                    .ok()
+                    .map(|g| g.issue_lsn(shard_id, delta as u64))
+            })
+            .unwrap_or(0)
+    }
+
+    /// Submit a Rewrite/RewriteSharded message. Only legal for TopLevel pools;
+    /// PerShard rewrites are per-shard operations and must be initiated by
+    /// the BGREWRITEAOF code path in step 6, not via this enum variant.
+    pub fn try_send_rewrite(&self, msg: AofMessage) -> Result<(), AofPoolSendError> {
+        use crate::persistence::aof_manifest::AofLayout;
+        debug_assert!(
+            matches!(msg, AofMessage::Rewrite(_) | AofMessage::RewriteSharded(_)),
+            "try_send_rewrite called with a non-Rewrite variant",
+        );
+        if self.layout == AofLayout::PerShard {
+            return Err(AofPoolSendError::RewriteUnsupportedInPerShard);
+        }
+        self.senders[0]
+            .try_send(msg)
+            .map_err(|_| AofPoolSendError::SendFailed)
+    }
+
+    /// Broadcast `Shutdown` to every writer. Used by orchestrated shutdown
+    /// paths in `main.rs`/`embedded.rs`. Each writer drains its channel and
+    /// fsyncs before exiting.
+    pub fn broadcast_shutdown(&self) {
+        for s in &self.senders {
+            let _ = s.try_send(AofMessage::Shutdown);
+        }
+    }
+
+    /// Number of underlying writer senders. 1 for TopLevel, num_shards for
+    /// PerShard.
+    #[inline]
+    pub fn num_writers(&self) -> usize {
+        self.senders.len()
+    }
+
+    /// Reports the pool's layout. Useful for places that need to refuse
+    /// PerShard-incompatible legacy code paths with a clear error.
+    #[inline]
+    pub fn layout(&self) -> crate::persistence::aof_manifest::AofLayout {
+        self.layout
+    }
+}
+
+#[cfg(test)]
+mod pool_tests {
+    use super::*;
+    use crate::persistence::aof_manifest::AofLayout;
+    use crate::runtime::channel;
+
+    #[test]
+    fn top_level_pool_routes_all_shards_to_writer_zero() {
+        let (tx, rx) = channel::mpsc_bounded::<AofMessage>(8);
+        let pool = AofWriterPool::top_level(tx);
+        assert_eq!(pool.num_writers(), 1);
+        assert_eq!(pool.layout(), AofLayout::TopLevel);
+
+        pool.try_send_append(0, 0, Bytes::from_static(b"a"));
+        pool.try_send_append(7, 0, Bytes::from_static(b"b"));
+        pool.try_send_append(42, 0, Bytes::from_static(b"c"));
+
+        let mut seen = 0;
+        while rx.try_recv().is_ok() {
+            seen += 1;
+        }
+        assert_eq!(seen, 3, "all 3 appends should land on writer 0");
+    }
+
+    #[test]
+    fn per_shard_pool_routes_each_shard_to_its_own_writer() {
+        let (tx0, rx0) = channel::mpsc_bounded::<AofMessage>(8);
+        let (tx1, rx1) = channel::mpsc_bounded::<AofMessage>(8);
+        let (tx2, rx2) = channel::mpsc_bounded::<AofMessage>(8);
+        let pool = AofWriterPool::per_shard(vec![tx0, tx1, tx2]);
+        assert_eq!(pool.num_writers(), 3);
+        assert_eq!(pool.layout(), AofLayout::PerShard);
+
+        pool.try_send_append(0, 100, Bytes::from_static(b"shard0"));
+        pool.try_send_append(1, 200, Bytes::from_static(b"shard1a"));
+        pool.try_send_append(1, 300, Bytes::from_static(b"shard1b"));
+        pool.try_send_append(2, 400, Bytes::from_static(b"shard2"));
+
+        let count = |rx: &channel::MpscReceiver<AofMessage>| -> usize {
+            let mut n = 0;
+            while rx.try_recv().is_ok() {
+                n += 1;
+            }
+            n
+        };
+        assert_eq!(count(&rx0), 1, "shard 0 writer should receive exactly 1");
+        assert_eq!(count(&rx1), 2, "shard 1 writer should receive exactly 2");
+        assert_eq!(count(&rx2), 1, "shard 2 writer should receive exactly 1");
+    }
+
+    #[test]
+    fn per_shard_pool_rejects_rewrite_with_explicit_error() {
+        let (tx0, _rx0) = channel::mpsc_bounded::<AofMessage>(8);
+        let (tx1, _rx1) = channel::mpsc_bounded::<AofMessage>(8);
+        let pool = AofWriterPool::per_shard(vec![tx0, tx1]);
+
+        let dummies: SharedDatabases = Arc::new(vec![]);
+        let err = pool.try_send_rewrite(AofMessage::Rewrite(dummies)).unwrap_err();
+        assert_eq!(err, AofPoolSendError::RewriteUnsupportedInPerShard);
+    }
+
+    #[test]
+    fn top_level_pool_accepts_rewrite() {
+        let (tx, rx) = channel::mpsc_bounded::<AofMessage>(8);
+        let pool = AofWriterPool::top_level(tx);
+
+        let dummies: SharedDatabases = Arc::new(vec![]);
+        pool.try_send_rewrite(AofMessage::Rewrite(dummies)).unwrap();
+        assert!(matches!(rx.try_recv(), Ok(AofMessage::Rewrite(_))));
+    }
+
+    #[test]
+    fn per_shard_pool_threads_lsn_field_to_each_writer() {
+        // Step 3 wire-format contract: try_send_append carries the issued LSN
+        // through to the writer task, which writes it as the per-entry header
+        // under PerShard layout. This unit test pins the channel-side contract
+        // (the disk-side framing is covered by writer-task integration).
+        let (tx0, rx0) = channel::mpsc_bounded::<AofMessage>(4);
+        let (tx1, rx1) = channel::mpsc_bounded::<AofMessage>(4);
+        let pool = AofWriterPool::per_shard(vec![tx0, tx1]);
+
+        pool.try_send_append(0, 42, Bytes::from_static(b"set foo 1"));
+        pool.try_send_append(1, 43, Bytes::from_static(b"set bar 2"));
+        pool.try_send_append(0, 44, Bytes::from_static(b"del foo"));
+
+        // Shard 0 should see (42, "set foo 1") then (44, "del foo").
+        match rx0.try_recv() {
+            Ok(AofMessage::Append { lsn, bytes }) => {
+                assert_eq!(lsn, 42, "shard 0 first entry lsn");
+                assert_eq!(bytes.as_ref(), b"set foo 1");
+            }
+            other => panic!("shard 0 first recv expected Append, got {:?}", other.is_ok()),
+        }
+        match rx0.try_recv() {
+            Ok(AofMessage::Append { lsn, bytes }) => {
+                assert_eq!(lsn, 44, "shard 0 second entry lsn");
+                assert_eq!(bytes.as_ref(), b"del foo");
+            }
+            other => panic!("shard 0 second recv expected Append, got {:?}", other.is_ok()),
+        }
+        // Shard 1 should see (43, "set bar 2") only.
+        match rx1.try_recv() {
+            Ok(AofMessage::Append { lsn, bytes }) => {
+                assert_eq!(lsn, 43, "shard 1 entry lsn");
+                assert_eq!(bytes.as_ref(), b"set bar 2");
+            }
+            other => panic!("shard 1 recv expected Append, got {:?}", other.is_ok()),
+        }
+    }
+
+    #[test]
+    fn broadcast_shutdown_reaches_every_writer() {
+        let (tx0, rx0) = channel::mpsc_bounded::<AofMessage>(2);
+        let (tx1, rx1) = channel::mpsc_bounded::<AofMessage>(2);
+        let (tx2, rx2) = channel::mpsc_bounded::<AofMessage>(2);
+        let pool = AofWriterPool::per_shard(vec![tx0, tx1, tx2]);
+
+        pool.broadcast_shutdown();
+
+        for (i, rx) in [&rx0, &rx1, &rx2].iter().enumerate() {
+            assert!(
+                matches!(rx.try_recv(), Ok(AofMessage::Shutdown)),
+                "writer {} did not receive Shutdown",
+                i
+            );
+        }
+    }
+}
+
 /// Serialize a Frame into RESP wire format bytes.
 pub fn serialize_command(frame: &Frame) -> Bytes {
     let mut buf = BytesMut::with_capacity(64);
@@ -194,7 +501,10 @@ pub async fn aof_writer_task(
 
         loop {
             match rx.recv() {
-                Ok(AofMessage::Append(data)) => {
+                // TopLevel writer: legacy v1 disk format is plain RESP. The
+                // LSN is ignored — TopLevel is single-shard so per-shard merge
+                // by LSN is moot.
+                Ok(AofMessage::Append { bytes: data, lsn: _ }) => {
                     if write_error {
                         continue; // Drop appends after persistent I/O failure
                     }
@@ -287,7 +597,8 @@ pub async fn aof_writer_task(
         tokio::select! {
             msg = rx.recv_async() => {
                 match msg {
-                    Ok(AofMessage::Append(data)) => {
+                    // TopLevel writer (tokio): legacy v1 plain RESP, lsn ignored.
+                    Ok(AofMessage::Append { bytes: data, lsn: _ }) => {
                         if let Err(e) = writer.write_all(&data).await {
                             error!("AOF write error: {}", e);
                             continue;
@@ -369,6 +680,376 @@ pub async fn aof_writer_task(
     }
 }
 
+/// Background per-shard AOF writer task (Option B step 2b).
+///
+/// One instance is spawned per shard in `PerShard` layout. Each instance owns
+/// `appendonlydir/shard-{shard_id}/moon.aof.{seq}.incr.aof` exclusively — no
+/// other writer touches that file, so there is no per-file locking.
+///
+/// Differences from [`aof_writer_task`] (TopLevel):
+/// - Opens `manifest.shard_incr_path(shard_id)` instead of `manifest.incr_path()`.
+/// - `Rewrite`/`RewriteSharded` variants are rejected (logged + dropped).
+///   The legacy single-writer rewrite enum has no meaning when each shard
+///   owns its own files; per-shard BGREWRITEAOF lands in RFC step 6.
+/// - Refuses to start if the loaded manifest's layout is `TopLevel` — the
+///   spawn site (step 2f) must only invoke this task body for `PerShard`
+///   layouts. Mismatch is a programmer error.
+///
+/// Wait/timeout/corruption semantics for manifest loading match the existing
+/// `aof_writer_task` (60s bounded wait, hard fail on corrupt manifest).
+pub async fn per_shard_aof_writer_task(
+    rx: channel::MpscReceiver<AofMessage>,
+    base_dir: PathBuf,
+    shard_id: u16,
+    fsync: FsyncPolicy,
+    cancel: CancellationToken,
+) {
+    #[cfg(feature = "runtime-tokio")]
+    {
+        use crate::persistence::aof_manifest::{AofLayout, AofManifest};
+        use tokio::io::AsyncWriteExt;
+
+        // Wait for main.rs recovery to create/load the manifest.
+        let manifest_wait_start = Instant::now();
+        const MANIFEST_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60);
+        let manifest = loop {
+            if cancel.is_cancelled() {
+                info!(
+                    "AOF writer shard {}: cancelled while waiting for manifest",
+                    shard_id
+                );
+                return;
+            }
+            if manifest_wait_start.elapsed() > MANIFEST_TIMEOUT {
+                error!(
+                    "AOF writer shard {}: manifest not found at {} after {:?}. Writer exiting.",
+                    shard_id,
+                    base_dir.display(),
+                    MANIFEST_TIMEOUT,
+                );
+                return;
+            }
+            match AofManifest::load(&base_dir) {
+                Ok(Some(m)) => break m,
+                Ok(None) => {
+                    tokio::time::sleep(std::time::Duration::from_millis(50)).await;
+                }
+                Err(e) => {
+                    error!(
+                        "AOF writer shard {}: manifest corrupt at {}: {}. Persistence disabled.",
+                        shard_id,
+                        base_dir.display(),
+                        e
+                    );
+                    return;
+                }
+            }
+        };
+
+        if manifest.layout != AofLayout::PerShard {
+            error!(
+                "AOF writer shard {}: layout is {:?}, expected PerShard. \
+                 per_shard_aof_writer_task should only be spawned for PerShard layouts. \
+                 Writer exiting.",
+                shard_id, manifest.layout
+            );
+            return;
+        }
+        if (shard_id as usize) >= manifest.shards.len() {
+            error!(
+                "AOF writer shard {}: out of range for manifest with {} shards. Writer exiting.",
+                shard_id,
+                manifest.shards.len()
+            );
+            return;
+        }
+
+        let incr_path = manifest.shard_incr_path(shard_id);
+        // Ensure shard-{N}/ exists. The manifest constructor for PerShard
+        // already creates these, but be defensive — a manual deletion or
+        // a manifest written by an older binary could leave them missing.
+        if let Some(parent) = incr_path.parent() {
+            if let Err(e) = tokio::fs::create_dir_all(parent).await {
+                error!(
+                    "AOF writer shard {}: failed to create dir {}: {}",
+                    shard_id,
+                    parent.display(),
+                    e
+                );
+                return;
+            }
+        }
+        let file: tokio::fs::File = match tokio::fs::OpenOptions::new()
+            .create(true)
+            .append(true)
+            .open(&incr_path)
+            .await
+        {
+            Ok(f) => f,
+            Err(e) => {
+                error!(
+                    "AOF writer shard {}: failed to open incr {}: {}",
+                    shard_id,
+                    incr_path.display(),
+                    e
+                );
+                return;
+            }
+        };
+        info!(
+            "AOF writer shard {}: seq {}, incr={}",
+            shard_id,
+            manifest.seq,
+            incr_path.display()
+        );
+
+        let mut writer = tokio::io::BufWriter::new(file);
+        let mut last_fsync = Instant::now();
+        let mut interval = tokio::time::interval(std::time::Duration::from_secs(1));
+        interval.tick().await;
+
+        loop {
+            tokio::select! {
+                msg = rx.recv_async() => {
+                    match msg {
+                        // PerShard writer (tokio): per RFC § 2 Rule 1 the on-disk
+                        // format is `[u64 lsn LE][u32 len LE][RESP bytes]`. Header
+                        // is written sequentially with the body — both calls land
+                        // in the same BufWriter so this is one syscall under load.
+                        Ok(AofMessage::Append { lsn, bytes: data }) => {
+                            let mut header = [0u8; 12];
+                            header[..8].copy_from_slice(&lsn.to_le_bytes());
+                            header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes());
+                            if let Err(e) = writer.write_all(&header).await {
+                                error!("AOF header write error shard {}: {}", shard_id, e);
+                                continue;
+                            }
+                            if let Err(e) = writer.write_all(&data).await {
+                                error!("AOF write error shard {}: {}", shard_id, e);
+                                continue;
+                            }
+                            if matches!(fsync, FsyncPolicy::Always) {
+                                let _ = writer.flush().await;
+                                let _ = writer.get_ref().sync_data().await;
+                            }
+                        }
+                        Ok(AofMessage::Rewrite(_)) | Ok(AofMessage::RewriteSharded(_)) => {
+                            warn!(
+                                "AOF writer shard {}: received Rewrite/RewriteSharded — \
+                                 not supported in PerShard layout, dropped. \
+                                 Per-shard BGREWRITEAOF lands in RFC step 6.",
+                                shard_id
+                            );
+                        }
+                        Ok(AofMessage::Shutdown) | Err(_) => {
+                            let _ = writer.flush().await;
+                            let _ = writer.get_ref().sync_data().await;
+                            info!("AOF writer shard {} shutting down", shard_id);
+                            break;
+                        }
+                    }
+                }
+                _ = interval.tick(), if fsync == FsyncPolicy::EverySec => {
+                    if last_fsync.elapsed() >= std::time::Duration::from_secs(1) {
+                        let _ = writer.flush().await;
+                        let _ = writer.get_ref().sync_data().await;
+                        last_fsync = Instant::now();
+                    }
+                }
+                _ = cancel.cancelled() => {
+                    let _ = writer.flush().await;
+                    let _ = writer.get_ref().sync_data().await;
+                    info!("AOF writer shard {} cancelled", shard_id);
+                    break;
+                }
+            }
+        }
+    }
+
+    #[cfg(feature = "runtime-monoio")]
+    {
+        use crate::persistence::aof_manifest::{AofLayout, AofManifest};
+        use std::io::Write;
+
+        let manifest_wait_start = Instant::now();
+        const MANIFEST_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60);
+        let manifest = loop {
+            if cancel.is_cancelled() {
+                info!(
+                    "AOF writer shard {}: cancelled while waiting for manifest",
+                    shard_id
+                );
+                return;
+            }
+            if manifest_wait_start.elapsed() > MANIFEST_TIMEOUT {
+                error!(
+                    "AOF writer shard {}: manifest not found at {} after {:?}. Writer exiting.",
+                    shard_id,
+                    base_dir.display(),
+                    MANIFEST_TIMEOUT,
+                );
+                return;
+            }
+            match AofManifest::load(&base_dir) {
+                Ok(Some(m)) => break m,
+                Ok(None) => {
+                    std::thread::sleep(std::time::Duration::from_millis(50));
+                }
+                Err(e) => {
+                    error!(
+                        "AOF writer shard {}: manifest corrupt at {}: {}. Persistence disabled.",
+                        shard_id,
+                        base_dir.display(),
+                        e
+                    );
+                    return;
+                }
+            }
+        };
+
+        if manifest.layout != AofLayout::PerShard {
+            error!(
+                "AOF writer shard {}: layout is {:?}, expected PerShard. Writer exiting.",
+                shard_id, manifest.layout
+            );
+            return;
+        }
+        if (shard_id as usize) >= manifest.shards.len() {
+            error!(
+                "AOF writer shard {}: out of range for manifest with {} shards. Writer exiting.",
+                shard_id,
+                manifest.shards.len()
+            );
+            return;
+        }
+
+        let incr_path = manifest.shard_incr_path(shard_id);
+        if let Some(parent) = incr_path.parent() {
+            if let Err(e) = std::fs::create_dir_all(parent) {
+                error!(
+                    "AOF writer shard {}: failed to create dir {}: {}",
+                    shard_id,
+                    parent.display(),
+                    e
+                );
+                return;
+            }
+        }
+        let mut file = match std::fs::OpenOptions::new()
+            .create(true)
+            .append(true)
+            .open(&incr_path)
+        {
+            Ok(f) => f,
+            Err(e) => {
+                error!(
+                    "AOF writer shard {}: failed to open incr {}: {}",
+                    shard_id,
+                    incr_path.display(),
+                    e
+                );
+                return;
+            }
+        };
+        info!(
+            "AOF writer shard {}: seq {}, incr={}",
+            shard_id,
+            manifest.seq,
+            incr_path.display()
+        );
+
+        let mut last_fsync = Instant::now();
+        let mut write_error = false;
+
+        loop {
+            match rx.recv() {
+                // PerShard writer (monoio): framed `[u64 lsn LE][u32 len LE][RESP]`.
+                // See the tokio twin above for format rationale.
+                Ok(AofMessage::Append { lsn, bytes: data }) => {
+                    if write_error {
+                        continue;
+                    }
+                    let mut header = [0u8; 12];
+                    header[..8].copy_from_slice(&lsn.to_le_bytes());
+                    header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes());
+                    if let Err(e) = file.write_all(&header) {
+                        error!(
+                            "AOF header write failed shard {} (seq {}): {}. Persistence degraded.",
+                            shard_id, manifest.seq, e
+                        );
+                        write_error = true;
+                        continue;
+                    }
+                    if let Err(e) = file.write_all(&data) {
+                        error!(
+                            "AOF write failed shard {} (seq {}): {}. Persistence degraded.",
+                            shard_id, manifest.seq, e
+                        );
+                        write_error = true;
+                        continue;
+                    }
+                    match fsync {
+                        FsyncPolicy::Always => {
+                            let t = Instant::now();
+                            if let Err(e) = file.flush().and_then(|_| file.sync_data()) {
+                                error!(
+                                    "AOF sync failed shard {} (seq {}, always): {}",
+                                    shard_id, manifest.seq, e
+                                );
+                                write_error = true;
+                            } else {
+                                crate::admin::metrics_setup::record_aof_fsync(
+                                    t.elapsed().as_micros() as u64,
+                                );
+                            }
+                        }
+                        FsyncPolicy::EverySec => {
+                            if last_fsync.elapsed() >= std::time::Duration::from_secs(1) {
+                                let t = Instant::now();
+                                if let Err(e) = file.flush().and_then(|_| file.sync_data()) {
+                                    error!(
+                                        "AOF sync failed shard {} (seq {}, everysec): {}",
+                                        shard_id, manifest.seq, e
+                                    );
+                                } else {
+                                    crate::admin::metrics_setup::record_aof_fsync(
+                                        t.elapsed().as_micros() as u64,
+                                    );
+                                    last_fsync = Instant::now();
+                                }
+                            }
+                        }
+                        FsyncPolicy::No => {}
+                    }
+                }
+                Ok(AofMessage::Rewrite(_)) | Ok(AofMessage::RewriteSharded(_)) => {
+                    warn!(
+                        "AOF writer shard {}: received Rewrite/RewriteSharded — \
+                         not supported in PerShard layout, dropped. \
+                         Per-shard BGREWRITEAOF lands in RFC step 6.",
+                        shard_id
+                    );
+                }
+                Ok(AofMessage::Shutdown) | Err(_) => {
+                    if !write_error {
+                        if let Err(e) = file.flush().and_then(|_| file.sync_data()) {
+                            error!(
+                                "AOF final sync failed shard {} (seq {}): {}",
+                                shard_id, manifest.seq, e
+                            );
+                        }
+                    }
+                    info!(
+                        "AOF writer shard {} shutting down (monoio, seq {})",
+                        shard_id, manifest.seq
+                    );
+                    break;
+                }
+            }
+        }
+    }
+}
+
 /// Replay an AOF file by parsing RESP commands and dispatching them.
 ///
 /// Returns the number of commands successfully replayed.
@@ -789,7 +1470,9 @@ fn drain_pending_appends(
     let mut outcome = DrainOutcome::default();
     while let Ok(msg) = rx.try_recv() {
         match msg {
-            AofMessage::Append(data) => {
+            // BGREWRITEAOF drain runs on the TopLevel writer (monoio) only;
+            // PerShard rewrite is RFC step 6. Legacy v1 disk format → ignore lsn.
+            AofMessage::Append { bytes: data, lsn: _ } => {
                 file.write_all(&data).map_err(|e| AofError::Io {
                     path: PathBuf::from("<aof incr drain>"),
                     source: e,
diff --git a/src/persistence/aof_manifest.rs b/src/persistence/aof_manifest.rs
index ce1efb3b..c6109d87 100644
--- a/src/persistence/aof_manifest.rs
+++ b/src/persistence/aof_manifest.rs
@@ -5,17 +5,34 @@
 //! manifest framing is the canonical on-disk marker; the human-readable
 //! "v1" umbrella also covers WAL v3 and RDB v2 sub-formats.
 //!
-//! Implements the same directory-based AOF format as Redis 7+:
+//! Two on-disk layouts coexist (selected at manifest creation time, never mixed
+//! within one directory):
+//!
+//! **TopLevel (manifest v1, single-shard / legacy):**
 //! ```text
 //! appendonlydir/
 //!   moon.aof.1.base.rdb     # RDB snapshot base
 //!   moon.aof.1.incr.aof     # Incremental RESP since base
-//!   moon.aof.manifest        # This file
+//!   moon.aof.manifest       # v1 text format
+//! ```
+//!
+//! **PerShard (manifest v2, multi-shard durability):**
+//! ```text
+//! appendonlydir/
+//!   moon.aof.manifest       # v2 text format (carries shard count + max_lsn)
+//!   shard-0/
+//!     moon.aof.1.base.rdb
+//!     moon.aof.1.incr.aof
+//!   shard-1/
+//!     moon.aof.1.base.rdb
+//!     moon.aof.1.incr.aof
+//!   …
 //! ```
 //!
-//! The manifest is a simple text file listing the active base and incremental
-//! files with their sequence numbers. On BGREWRITEAOF, the sequence increments,
-//! a new base + incr pair is created, and old files are deleted.
+//! The manifest text format is line-prefix based. v1 manifests have no
+//! `version` line; v2 manifests begin with `version 2`. On BGREWRITEAOF the
+//! sequence increments, a new base + incr pair is created per shard (PerShard)
+//! or at top level (TopLevel), and old files are deleted.
 
 use std::io::Write;
 use std::path::{Path, PathBuf};
@@ -25,13 +42,44 @@ use tracing::{error, info, warn};
 const MANIFEST_NAME: &str = "moon.aof.manifest";
 const AOF_DIR_NAME: &str = "appendonlydir";
 
+/// On-disk layout discriminator.
+///
+/// `TopLevel` is the legacy single-shard layout from manifest v1. `PerShard`
+/// is the multi-shard layout introduced with manifest v2 — used whenever
+/// `num_shards >= 2`. A `--shards 1` deployment with an existing v1 manifest
+/// stays TopLevel until explicitly migrated.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum AofLayout {
+    /// Legacy single-shard layout: `appendonlydir/moon.aof.{seq}.{base|incr}.*`.
+    TopLevel,
+    /// Per-shard layout: `appendonlydir/shard-{N}/moon.aof.{seq}.{base|incr}.*`.
+    PerShard,
+}
+
+/// Per-shard manifest entry. One per shard in `PerShard` layout.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct ShardManifest {
+    /// Shard ID (0..num_shards).
+    pub shard_id: u16,
+    /// Max LSN persisted to this shard's incr file so far. Semantics defined
+    /// by step 3 (LSN tagging) of the per-shard AOF RFC — until then this is
+    /// 0 and recovery does not use it. Once step 3 ships, recovery seeds
+    /// `master_repl_offset = max(shards[*].max_lsn)` before accepting writes.
+    pub max_lsn: u64,
+}
+
 /// Active AOF file set tracked by the manifest.
 #[derive(Debug, Clone)]
 pub struct AofManifest {
     /// Base directory (parent of `appendonlydir/`)
     pub dir: PathBuf,
-    /// Current sequence number (incremented on each rewrite)
+    /// Current sequence number (incremented on each rewrite).
     pub seq: u64,
+    /// On-disk layout. Determines path computation for base/incr files.
+    pub layout: AofLayout,
+    /// Per-shard metadata. Length is 1 for `TopLevel`, `num_shards` for
+    /// `PerShard`. Indexed by `shard_id`.
+    pub shards: Vec<ShardManifest>,
 }
 
 impl AofManifest {
@@ -46,25 +94,76 @@ impl AofManifest {
     }
 
     /// Path to the base RDB file for the current sequence.
+    ///
+    /// Layout-aware: TopLevel returns `appendonlydir/moon.aof.{seq}.base.rdb`;
+    /// PerShard routes to `appendonlydir/shard-0/moon.aof.{seq}.base.rdb`.
+    /// This single-file helper is meaningful only when there is one shard
+    /// (post-migration `--shards 1`); a multi-shard PerShard manifest has N
+    /// base files and the caller must use [`Self::shard_base_path`] instead.
+    /// In debug builds, calling this on a multi-shard PerShard manifest
+    /// asserts; in release it returns the shard-0 path so production stays
+    /// recoverable rather than panicking on a stale call site.
     pub fn base_path(&self) -> PathBuf {
-        self.aof_dir()
-            .join(format!("moon.aof.{}.base.rdb", self.seq))
+        match self.layout {
+            AofLayout::TopLevel => self
+                .aof_dir()
+                .join(format!("moon.aof.{}.base.rdb", self.seq)),
+            AofLayout::PerShard => {
+                debug_assert!(
+                    self.shards.len() == 1,
+                    "base_path() called on multi-shard PerShard manifest; use shard_base_path(shard_id)",
+                );
+                self.shard_base_path_seq(0, self.seq)
+            }
+        }
     }
 
     /// Path to the incremental RESP file for the current sequence.
+    ///
+    /// Layout-aware — see [`Self::base_path`] for the same routing rules.
     pub fn incr_path(&self) -> PathBuf {
-        self.aof_dir()
-            .join(format!("moon.aof.{}.incr.aof", self.seq))
+        match self.layout {
+            AofLayout::TopLevel => self
+                .aof_dir()
+                .join(format!("moon.aof.{}.incr.aof", self.seq)),
+            AofLayout::PerShard => {
+                debug_assert!(
+                    self.shards.len() == 1,
+                    "incr_path() called on multi-shard PerShard manifest; use shard_incr_path(shard_id)",
+                );
+                self.shard_incr_path_seq(0, self.seq)
+            }
+        }
     }
 
-    /// Path to the base RDB file for a given sequence.
+    /// Path to the base RDB file for a given sequence. Layout-aware — see
+    /// [`Self::base_path`].
     pub fn base_path_seq(&self, seq: u64) -> PathBuf {
-        self.aof_dir().join(format!("moon.aof.{}.base.rdb", seq))
+        match self.layout {
+            AofLayout::TopLevel => self.aof_dir().join(format!("moon.aof.{}.base.rdb", seq)),
+            AofLayout::PerShard => {
+                debug_assert!(
+                    self.shards.len() == 1,
+                    "base_path_seq() called on multi-shard PerShard manifest; use shard_base_path_seq(shard_id, seq)",
+                );
+                self.shard_base_path_seq(0, seq)
+            }
+        }
     }
 
-    /// Path to the incremental RESP file for a given sequence.
+    /// Path to the incremental RESP file for a given sequence. Layout-aware —
+    /// see [`Self::base_path`].
     pub fn incr_path_seq(&self, seq: u64) -> PathBuf {
-        self.aof_dir().join(format!("moon.aof.{}.incr.aof", seq))
+        match self.layout {
+            AofLayout::TopLevel => self.aof_dir().join(format!("moon.aof.{}.incr.aof", seq)),
+            AofLayout::PerShard => {
+                debug_assert!(
+                    self.shards.len() == 1,
+                    "incr_path_seq() called on multi-shard PerShard manifest; use shard_incr_path_seq(shard_id, seq)",
+                );
+                self.shard_incr_path_seq(0, seq)
+            }
+        }
     }
 
     /// Create the `appendonlydir/` and write the initial manifest.
@@ -82,6 +181,11 @@ impl AofManifest {
         let manifest = Self {
             dir: dir.to_path_buf(),
             seq: 1,
+            layout: AofLayout::TopLevel,
+            shards: vec![ShardManifest {
+                shard_id: 0,
+                max_lsn: 0,
+            }],
         };
         std::fs::create_dir_all(manifest.aof_dir())?;
 
@@ -119,6 +223,11 @@ impl AofManifest {
         let manifest = Self {
             dir: dir.to_path_buf(),
             seq: 1,
+            layout: AofLayout::TopLevel,
+            shards: vec![ShardManifest {
+                shard_id: 0,
+                max_lsn: 0,
+            }],
         };
         std::fs::create_dir_all(manifest.aof_dir())?;
 
@@ -157,6 +266,53 @@ impl AofManifest {
 
         let content = std::fs::read_to_string(&manifest_path)?;
 
+        // Detect format version. v1 manifests have no `version` line and use
+        // line prefixes `seq`/`base`/`incr`. v2 manifests start with `version 2`
+        // and carry per-shard records.
+        let mut format_version: u8 = 1;
+        for line in content.lines() {
+            let line = line.trim();
+            if let Some(val) = line.strip_prefix("version ") {
+                if let Ok(v) = val.parse::<u8>() {
+                    format_version = v;
+                }
+                break;
+            }
+            if !line.is_empty() {
+                // First non-blank line is not a version header → v1.
+                break;
+            }
+        }
+
+        let manifest = match format_version {
+            1 => Self::parse_v1(&content, dir, &manifest_path)?,
+            2 => Self::parse_v2(&content, dir, &manifest_path)?,
+            other => {
+                return Err(std::io::Error::new(
+                    std::io::ErrorKind::InvalidData,
+                    format!(
+                        "AOF manifest at {} has unsupported format version {} (max supported: 2)",
+                        manifest_path.display(),
+                        other,
+                    ),
+                ));
+            }
+        };
+
+        // Best-effort orphan cleanup: delete stray base/incr files from aborted
+        // rewrites. A crash between advance() steps 1-3 leaves a new base RDB on
+        // disk that the active manifest never references. Without this sweep,
+        // repeated crashes during rewrite can fill the disk with zombie files.
+        //
+        // Safe to call here: parse_* verified the manifest has all required
+        // records, so cleanup_orphans won't delete the active files.
+        manifest.cleanup_orphans();
+
+        Ok(Some(manifest))
+    }
+
+    /// Parse a v1 (TopLevel, single-shard) manifest.
+    fn parse_v1(content: &str, dir: &Path, manifest_path: &Path) -> std::io::Result<Self> {
         let mut seq = 0u64;
         let mut has_base_record = false;
         let mut has_incr_record = false;
@@ -183,10 +339,6 @@ impl AofManifest {
             ));
         }
 
-        // A valid manifest must have all three records (seq, base, incr).
-        // A truncated manifest with only "seq N" but no base/incr lines could
-        // trigger orphan cleanup that deletes the real base RDB referenced by
-        // the previous valid manifest. Require all records before proceeding.
         if !has_base_record || !has_incr_record {
             return Err(std::io::Error::new(
                 std::io::ErrorKind::InvalidData,
@@ -200,21 +352,182 @@ impl AofManifest {
             ));
         }
 
-        let manifest = Self {
+        Ok(Self {
             dir: dir.to_path_buf(),
             seq,
-        };
+            layout: AofLayout::TopLevel,
+            shards: vec![ShardManifest {
+                shard_id: 0,
+                max_lsn: 0,
+            }],
+        })
+    }
 
-        // Best-effort orphan cleanup: delete stray base/incr files from aborted
-        // rewrites. A crash between advance() steps 1-3 leaves a new base RDB on
-        // disk that the active manifest never references. Without this sweep,
-        // repeated crashes during rewrite can fill the disk with zombie files.
-        //
-        // Safe to call here: we verified the manifest has all three records
-        // (seq, base, incr), so cleanup_orphans won't delete the active files.
-        manifest.cleanup_orphans();
+    /// Parse a v2 (PerShard, multi-shard) manifest.
+    ///
+    /// Expected line format:
+    /// ```text
+    /// version 2
+    /// seq N
+    /// shards K
+    /// shard 0 max_lsn LSN0
+    /// shard 1 max_lsn LSN1
+    /// ...
+    /// ```
+    ///
+    /// Per-shard `base`/`incr` paths are derived from `shard-{N}/moon.aof.{seq}.*`
+    /// rather than stored explicitly — the layout is canonical, so storing
+    /// paths invites drift between the stored value and the computed one.
+    fn parse_v2(content: &str, dir: &Path, manifest_path: &Path) -> std::io::Result<Self> {
+        let mut seq = 0u64;
+        let mut num_shards: Option<u16> = None;
+        let mut shards: Vec<ShardManifest> = Vec::new();
 
-        Ok(Some(manifest))
+        for line in content.lines() {
+            let line = line.trim();
+            if line.is_empty() || line.starts_with('#') {
+                continue;
+            }
+            if line == "version 2" {
+                continue;
+            } else if let Some(val) = line.strip_prefix("seq ") {
+                seq = val.parse::<u64>().map_err(|e| {
+                    std::io::Error::new(
+                        std::io::ErrorKind::InvalidData,
+                        format!(
+                            "AOF manifest at {} has invalid seq line `{}`: {}",
+                            manifest_path.display(),
+                            line,
+                            e,
+                        ),
+                    )
+                })?;
+            } else if let Some(val) = line.strip_prefix("shards ") {
+                num_shards = Some(val.parse::<u16>().map_err(|e| {
+                    std::io::Error::new(
+                        std::io::ErrorKind::InvalidData,
+                        format!(
+                            "AOF manifest at {} has invalid shards line `{}`: {}",
+                            manifest_path.display(),
+                            line,
+                            e,
+                        ),
+                    )
+                })?);
+            } else if let Some(rest) = line.strip_prefix("shard ") {
+                // Format: `shard <id> max_lsn <lsn>`
+                let mut it = rest.split_whitespace();
+                let id_str = it.next().ok_or_else(|| {
+                    std::io::Error::new(
+                        std::io::ErrorKind::InvalidData,
+                        format!(
+                            "AOF manifest at {} has shard line missing id: `{}`",
+                            manifest_path.display(),
+                            line,
+                        ),
+                    )
+                })?;
+                let id: u16 = id_str.parse().map_err(|e| {
+                    std::io::Error::new(
+                        std::io::ErrorKind::InvalidData,
+                        format!(
+                            "AOF manifest at {} has shard line invalid id `{}`: {}",
+                            manifest_path.display(),
+                            id_str,
+                            e,
+                        ),
+                    )
+                })?;
+                // Expect `max_lsn <lsn>`.
+                let label = it.next().unwrap_or("");
+                let val_str = it.next().unwrap_or("0");
+                if label != "max_lsn" {
+                    return Err(std::io::Error::new(
+                        std::io::ErrorKind::InvalidData,
+                        format!(
+                            "AOF manifest at {} shard {} expected `max_lsn`, got `{}`",
+                            manifest_path.display(),
+                            id,
+                            label,
+                        ),
+                    ));
+                }
+                let max_lsn: u64 = val_str.parse().map_err(|e| {
+                    std::io::Error::new(
+                        std::io::ErrorKind::InvalidData,
+                        format!(
+                            "AOF manifest at {} shard {} invalid max_lsn `{}`: {}",
+                            manifest_path.display(),
+                            id,
+                            val_str,
+                            e,
+                        ),
+                    )
+                })?;
+                shards.push(ShardManifest {
+                    shard_id: id,
+                    max_lsn,
+                });
+            }
+            // Unknown lines are tolerated (forward-compat). Strict parsers can
+            // be added at v3 if needed.
+        }
+
+        if seq == 0 {
+            return Err(std::io::Error::new(
+                std::io::ErrorKind::InvalidData,
+                format!(
+                    "AOF manifest at {} has no valid sequence number",
+                    manifest_path.display()
+                ),
+            ));
+        }
+
+        let expected = num_shards.ok_or_else(|| {
+            std::io::Error::new(
+                std::io::ErrorKind::InvalidData,
+                format!(
+                    "AOF manifest at {} is missing required `shards N` line",
+                    manifest_path.display()
+                ),
+            )
+        })?;
+
+        if shards.len() != expected as usize {
+            return Err(std::io::Error::new(
+                std::io::ErrorKind::InvalidData,
+                format!(
+                    "AOF manifest at {} declares shards={} but has {} shard records",
+                    manifest_path.display(),
+                    expected,
+                    shards.len(),
+                ),
+            ));
+        }
+
+        // Sort by shard_id and verify contiguous range [0, expected).
+        shards.sort_by_key(|s| s.shard_id);
+        for (i, s) in shards.iter().enumerate() {
+            if s.shard_id as usize != i {
+                return Err(std::io::Error::new(
+                    std::io::ErrorKind::InvalidData,
+                    format!(
+                        "AOF manifest at {} has non-contiguous shard ids (expected {} at position {}, got {})",
+                        manifest_path.display(),
+                        i,
+                        i,
+                        s.shard_id,
+                    ),
+                ));
+            }
+        }
+
+        Ok(Self {
+            dir: dir.to_path_buf(),
+            seq,
+            layout: AofLayout::PerShard,
+            shards,
+        })
     }
 
     /// Delete any base/incr files in `appendonlydir/` that do not match the
@@ -258,14 +571,33 @@ impl AofManifest {
     }
 
     /// Write the manifest file atomically (write tmp + rename).
+    ///
+    /// Emits v1 format for `TopLevel` and v2 for `PerShard`. The format is
+    /// selected by `self.layout`, never by callers — preserving the invariant
+    /// that one directory holds one layout.
     pub fn write_manifest(&self) -> std::io::Result<()> {
         let manifest_path = self.manifest_path();
         let tmp_path = manifest_path.with_extension("tmp");
 
-        let content = format!(
-            "seq {}\nbase moon.aof.{}.base.rdb\nincr moon.aof.{}.incr.aof\n",
-            self.seq, self.seq, self.seq
-        );
+        let content = match self.layout {
+            AofLayout::TopLevel => format!(
+                "seq {}\nbase moon.aof.{}.base.rdb\nincr moon.aof.{}.incr.aof\n",
+                self.seq, self.seq, self.seq
+            ),
+            AofLayout::PerShard => {
+                let mut s = String::with_capacity(64 + self.shards.len() * 40);
+                s.push_str("version 2\n");
+                s.push_str(&format!("seq {}\n", self.seq));
+                s.push_str(&format!("shards {}\n", self.shards.len()));
+                for shard in &self.shards {
+                    s.push_str(&format!(
+                        "shard {} max_lsn {}\n",
+                        shard.shard_id, shard.max_lsn
+                    ));
+                }
+                s
+            }
+        };
 
         let mut f = std::fs::File::create(&tmp_path)?;
         f.write_all(content.as_bytes())?;
@@ -274,6 +606,291 @@ impl AofManifest {
         Ok(())
     }
 
+    // ------------------------------------------------------------------
+    // Per-shard layout helpers
+    // ------------------------------------------------------------------
+
+    /// Directory holding a shard's AOF files.
+    ///
+    /// - `TopLevel`: `appendonlydir/` (the shard_id argument is asserted to be 0).
+    /// - `PerShard`: `appendonlydir/shard-{shard_id}/`.
+    pub fn shard_dir(&self, shard_id: u16) -> PathBuf {
+        match self.layout {
+            AofLayout::TopLevel => {
+                debug_assert_eq!(shard_id, 0, "TopLevel layout only has shard 0");
+                self.aof_dir()
+            }
+            AofLayout::PerShard => self.aof_dir().join(format!("shard-{}", shard_id)),
+        }
+    }
+
+    /// Path to a shard's base RDB file for the current sequence.
+    pub fn shard_base_path(&self, shard_id: u16) -> PathBuf {
+        self.shard_dir(shard_id)
+            .join(format!("moon.aof.{}.base.rdb", self.seq))
+    }
+
+    /// Path to a shard's incremental RESP file for the current sequence.
+    pub fn shard_incr_path(&self, shard_id: u16) -> PathBuf {
+        self.shard_dir(shard_id)
+            .join(format!("moon.aof.{}.incr.aof", self.seq))
+    }
+
+    /// Path to a shard's base RDB file for a given sequence.
+    pub fn shard_base_path_seq(&self, shard_id: u16, seq: u64) -> PathBuf {
+        self.shard_dir(shard_id)
+            .join(format!("moon.aof.{}.base.rdb", seq))
+    }
+
+    /// Path to a shard's incremental RESP file for a given sequence.
+    pub fn shard_incr_path_seq(&self, shard_id: u16, seq: u64) -> PathBuf {
+        self.shard_dir(shard_id)
+            .join(format!("moon.aof.{}.incr.aof", seq))
+    }
+
+    /// Maximum LSN persisted across all shards.
+    ///
+    /// Computed (not stored) so the stored value can never drift from
+    /// the per-shard records. Returns 0 if `shards` is empty (defensive;
+    /// constructors guarantee at least one shard).
+    pub fn global_max_lsn(&self) -> u64 {
+        self.shards.iter().map(|s| s.max_lsn).max().unwrap_or(0)
+    }
+
+    /// Verify that the manifest matches the runtime shard count.
+    ///
+    /// Returns the verbatim error from RFC § 3 if the shard count differs,
+    /// for operator-facing consistency. Callers (typically `main.rs` boot)
+    /// should treat this as fatal: continuing with a mismatched shard count
+    /// silently drops data from shards that no longer exist or replays a
+    /// shard's data into the wrong DashTable.
+    pub fn verify_shard_count(&self, expected: u16) -> Result<(), String> {
+        let actual = self.shards.len() as u16;
+        if actual != expected {
+            return Err(format!(
+                "ERR shard count changed (manifest={}, config={}); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md",
+                actual, expected
+            ));
+        }
+        Ok(())
+    }
+
+    /// Returns true if the on-disk layout under `appendonlydir/` matches the
+    /// legacy TopLevel format (files at top level, no `shard-N/` subdirs).
+    ///
+    /// Used by callers to detect when a v1 single-shard deployment is being
+    /// upgraded to v2 multi-shard and needs explicit migration. Does NOT
+    /// migrate — separate from `migrate_top_level_to_per_shard` so the side
+    /// effect is opt-in, not hidden in a load path.
+    pub fn is_legacy_top_level_layout(dir: &Path) -> bool {
+        let aof_dir = dir.join(AOF_DIR_NAME);
+        if !aof_dir.exists() {
+            return false;
+        }
+        let entries = match std::fs::read_dir(&aof_dir) {
+            Ok(e) => e,
+            Err(_) => return false,
+        };
+        for entry in entries.flatten() {
+            let name = entry.file_name();
+            let Some(name_str) = name.to_str() else {
+                continue;
+            };
+            if name_str.starts_with("moon.aof.")
+                && (name_str.ends_with(".base.rdb") || name_str.ends_with(".incr.aof"))
+            {
+                return true;
+            }
+        }
+        false
+    }
+
+    /// Migrate a single-shard TopLevel layout in place to a single-shard
+    /// PerShard layout.
+    ///
+    /// Moves `appendonlydir/moon.aof.{seq}.{base.rdb,incr.aof}` into
+    /// `appendonlydir/shard-0/`, then rewrites the manifest as v2 with
+    /// `shards 1`. Idempotent: a second call on an already-PerShard manifest
+    /// returns Ok with no filesystem changes.
+    ///
+    /// This is the RFC § 5 case 1 migration — zero data movement (rename only),
+    /// safe to run on first boot after upgrading from v0.1.x. Multi-shard
+    /// migrations from legacy AOF (case 2) use the `moon migrate-aof`
+    /// subcommand and are NOT handled here.
+    pub fn migrate_top_level_to_per_shard(&mut self) -> std::io::Result<()> {
+        if self.layout == AofLayout::PerShard {
+            return Ok(());
+        }
+        if self.shards.len() != 1 {
+            return Err(std::io::Error::new(
+                std::io::ErrorKind::InvalidInput,
+                format!(
+                    "migrate_top_level_to_per_shard called with {} shards; \
+                     only single-shard TopLevel can be migrated in place",
+                    self.shards.len()
+                ),
+            ));
+        }
+
+        // Compute paths up front. shard_dir/shard_*_path_seq for a single-
+        // shard target are pure path computations and do NOT depend on
+        // self.layout, so it is safe to derive them while layout is still
+        // TopLevel.
+        let old_base = self.aof_dir().join(format!("moon.aof.{}.base.rdb", self.seq));
+        let old_incr = self.aof_dir().join(format!("moon.aof.{}.incr.aof", self.seq));
+        let new_dir = self.aof_dir().join("shard-0");
+        let new_base = new_dir.join(format!("moon.aof.{}.base.rdb", self.seq));
+        let new_incr = new_dir.join(format!("moon.aof.{}.incr.aof", self.seq));
+
+        if !old_base.exists() {
+            // Pre-flight check: nothing moved yet, no rollback needed.
+            return Err(std::io::Error::new(
+                std::io::ErrorKind::NotFound,
+                format!(
+                    "TopLevel→PerShard migration: source base {} not found",
+                    old_base.display()
+                ),
+            ));
+        }
+        std::fs::create_dir_all(&new_dir)?;
+
+        // Move base. If this fails, no on-disk mutation happened yet — bail
+        // without rollback. Layout stays TopLevel until commit at the bottom.
+        std::fs::rename(&old_base, &new_base)?;
+
+        // Base is now in shard-0/. Any subsequent error must restore it.
+        let moved_incr: bool;
+        let created_incr: bool;
+        if old_incr.exists() {
+            if let Err(e) = std::fs::rename(&old_incr, &new_incr) {
+                if let Err(re) = std::fs::rename(&new_base, &old_base) {
+                    error!(
+                        "Migration rollback: failed to restore base {} → {}: {}",
+                        new_base.display(),
+                        old_base.display(),
+                        re
+                    );
+                }
+                return Err(e);
+            }
+            moved_incr = true;
+            created_incr = false;
+        } else {
+            match std::fs::File::create(&new_incr) {
+                Ok(_) => {
+                    moved_incr = false;
+                    created_incr = true;
+                }
+                Err(e) => {
+                    if let Err(re) = std::fs::rename(&new_base, &old_base) {
+                        error!(
+                            "Migration rollback: failed to restore base {} → {}: {}",
+                            new_base.display(),
+                            old_base.display(),
+                            re
+                        );
+                    }
+                    return Err(e);
+                }
+            }
+        }
+
+        // Commit: flip layout, persist as v2. If write_manifest fails, undo
+        // every filesystem mutation and restore layout so the next boot still
+        // sees a valid v1 TopLevel deployment.
+        self.layout = AofLayout::PerShard;
+        if let Err(e) = self.write_manifest() {
+            self.layout = AofLayout::TopLevel;
+            if moved_incr {
+                if let Err(re) = std::fs::rename(&new_incr, &old_incr) {
+                    error!(
+                        "Migration rollback: failed to restore incr {} → {}: {}",
+                        new_incr.display(),
+                        old_incr.display(),
+                        re
+                    );
+                }
+            } else if created_incr {
+                if let Err(re) = std::fs::remove_file(&new_incr) {
+                    warn!(
+                        "Migration rollback: failed to remove freshly created incr {}: {}",
+                        new_incr.display(),
+                        re
+                    );
+                }
+            }
+            if let Err(re) = std::fs::rename(&new_base, &old_base) {
+                error!(
+                    "Migration rollback: failed to restore base {} → {}: {}. \
+                     Manifest dir {} may be in an inconsistent state.",
+                    new_base.display(),
+                    old_base.display(),
+                    re,
+                    self.dir.display()
+                );
+            }
+            return Err(e);
+        }
+
+        info!(
+            "AOF migrated: TopLevel → PerShard (single shard) at {}",
+            self.aof_dir().display()
+        );
+        Ok(())
+    }
+
+    /// Create the `appendonlydir/` and write an initial v2 manifest for the
+    /// given shard count.
+    ///
+    /// Each shard gets its own `shard-{N}/` subdirectory with an empty base
+    /// RDB and an empty incr file. Mirrors `initialize()` semantics: the
+    /// `(base + incr)` invariant holds from the first boot, so recovery can
+    /// replay incr-only state without complaint.
+    pub fn initialize_multi(dir: &Path, num_shards: u16) -> std::io::Result<Self> {
+        if num_shards == 0 {
+            return Err(std::io::Error::new(
+                std::io::ErrorKind::InvalidInput,
+                "initialize_multi requires num_shards >= 1",
+            ));
+        }
+        let manifest = Self {
+            dir: dir.to_path_buf(),
+            seq: 1,
+            layout: AofLayout::PerShard,
+            shards: (0..num_shards)
+                .map(|id| ShardManifest {
+                    shard_id: id,
+                    max_lsn: 0,
+                })
+                .collect(),
+        };
+        std::fs::create_dir_all(manifest.aof_dir())?;
+
+        // Per-shard empty RDB. Single Database::default() inside a 1-element
+        // slice matches `initialize()`'s empty-RDB shape for each shard.
+        let empty_dbs: [crate::storage::Database; 0] = [];
+        let empty_rdb = crate::persistence::rdb::save_to_bytes(&empty_dbs)
+            .map_err(|e| std::io::Error::other(format!("empty RDB serialize: {e}")))?;
+
+        for shard_id in 0..num_shards {
+            let shard_dir = manifest.shard_dir(shard_id);
+            std::fs::create_dir_all(&shard_dir)?;
+
+            let base_path = manifest.shard_base_path(shard_id);
+            let tmp_path = base_path.with_extension("rdb.tmp");
+            {
+                let mut f = std::fs::File::create(&tmp_path)?;
+                f.write_all(&empty_rdb)?;
+                f.sync_data()?;
+            }
+            std::fs::rename(&tmp_path, &base_path)?;
+            std::fs::File::create(manifest.shard_incr_path(shard_id))?;
+        }
+
+        manifest.write_manifest()?;
+        Ok(manifest)
+    }
+
     /// Advance to the next sequence: write new base RDB, create new incr file,
     /// update manifest, delete old files.
     ///
@@ -517,3 +1134,301 @@ fn replay_incr_resp(
 
     Ok(count)
 }
+
+#[cfg(test)]
+mod tests_v2 {
+    //! Unit tests for the v2 (PerShard) manifest format.
+    //!
+    //! Covers the Step 1 deliverable of the per-shard AOF RFC:
+    //! - v1 manifests continue to load as TopLevel (single-shard, shard_id=0)
+    //! - v2 round-trip: write → load → equivalent struct shape
+    //! - shard count mismatch produces the verbatim RFC § 3 error
+    //! - migrate_top_level_to_per_shard performs in-place rename and rewrites
+    //!   the manifest as v2
+    //! - global_max_lsn computes max across shards
+    //! - is_legacy_top_level_layout detects top-level files
+
+    use super::*;
+    use std::fs;
+
+    fn temp_dir() -> PathBuf {
+        let d = std::env::temp_dir().join(format!(
+            "moon-aof-manifest-test-{}-{}",
+            std::process::id(),
+            std::time::SystemTime::now()
+                .duration_since(std::time::UNIX_EPOCH)
+                .map(|d| d.as_nanos())
+                .unwrap_or(0)
+        ));
+        fs::create_dir_all(&d).expect("temp dir create");
+        d
+    }
+
+    #[test]
+    fn v1_manifest_loads_as_top_level_single_shard() {
+        let dir = temp_dir();
+        let m = AofManifest::initialize(&dir).expect("initialize v1");
+
+        assert_eq!(m.layout, AofLayout::TopLevel);
+        assert_eq!(m.shards.len(), 1);
+        assert_eq!(m.shards[0].shard_id, 0);
+        assert_eq!(m.shards[0].max_lsn, 0);
+
+        // Reload from disk
+        let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+        assert_eq!(reloaded.layout, AofLayout::TopLevel);
+        assert_eq!(reloaded.shards.len(), 1);
+        assert_eq!(reloaded.seq, m.seq);
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn v2_manifest_round_trips() {
+        let dir = temp_dir();
+        let m = AofManifest::initialize_multi(&dir, 4).expect("initialize_multi");
+
+        assert_eq!(m.layout, AofLayout::PerShard);
+        assert_eq!(m.shards.len(), 4);
+        for (i, s) in m.shards.iter().enumerate() {
+            assert_eq!(s.shard_id, i as u16);
+            assert_eq!(s.max_lsn, 0);
+        }
+
+        // Per-shard subdirs were created with empty base + incr.
+        for i in 0..4u16 {
+            assert!(m.shard_dir(i).exists(), "shard-{} dir exists", i);
+            assert!(m.shard_base_path(i).exists(), "shard-{} base exists", i);
+            assert!(m.shard_incr_path(i).exists(), "shard-{} incr exists", i);
+        }
+
+        let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+        assert_eq!(reloaded.layout, AofLayout::PerShard);
+        assert_eq!(reloaded.shards.len(), 4);
+        assert_eq!(reloaded.seq, m.seq);
+        for (i, s) in reloaded.shards.iter().enumerate() {
+            assert_eq!(s.shard_id, i as u16);
+        }
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn verify_shard_count_emits_rfc_error_verbatim() {
+        let m = AofManifest {
+            dir: PathBuf::from("/tmp/nowhere"),
+            seq: 1,
+            layout: AofLayout::PerShard,
+            shards: vec![
+                ShardManifest { shard_id: 0, max_lsn: 0 },
+                ShardManifest { shard_id: 1, max_lsn: 0 },
+            ],
+        };
+        let err = m.verify_shard_count(4).expect_err("should mismatch");
+        assert_eq!(
+            err,
+            "ERR shard count changed (manifest=2, config=4); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md"
+        );
+
+        // Matching count succeeds.
+        m.verify_shard_count(2).expect("match");
+    }
+
+    #[test]
+    fn migrate_top_level_to_per_shard_moves_files_and_rewrites_manifest() {
+        let dir = temp_dir();
+        let mut m = AofManifest::initialize(&dir).expect("initialize v1");
+
+        // Write a marker into the incr file so we can prove the contents
+        // survive the rename.
+        let original_incr = m.aof_dir().join(format!("moon.aof.{}.incr.aof", m.seq));
+        fs::write(&original_incr, b"MARKER").expect("write incr marker");
+
+        m.migrate_top_level_to_per_shard().expect("migrate");
+
+        assert_eq!(m.layout, AofLayout::PerShard);
+        assert!(!original_incr.exists(), "old incr removed by rename");
+        let new_incr = m.shard_incr_path(0);
+        assert!(new_incr.exists(), "new shard-0 incr exists");
+        let contents = fs::read(&new_incr).expect("read new incr");
+        assert_eq!(contents, b"MARKER", "incr contents preserved");
+
+        // Reloaded manifest is v2.
+        let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+        assert_eq!(reloaded.layout, AofLayout::PerShard);
+        assert_eq!(reloaded.shards.len(), 1);
+
+        // Idempotency: second call is a no-op.
+        m.migrate_top_level_to_per_shard().expect("idempotent");
+        assert_eq!(m.layout, AofLayout::PerShard);
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn global_max_lsn_returns_max_across_shards() {
+        let m = AofManifest {
+            dir: PathBuf::from("/tmp/nowhere"),
+            seq: 1,
+            layout: AofLayout::PerShard,
+            shards: vec![
+                ShardManifest { shard_id: 0, max_lsn: 100 },
+                ShardManifest { shard_id: 1, max_lsn: 500 },
+                ShardManifest { shard_id: 2, max_lsn: 250 },
+            ],
+        };
+        assert_eq!(m.global_max_lsn(), 500);
+    }
+
+    #[test]
+    fn is_legacy_top_level_layout_detects_v1_files() {
+        let dir = temp_dir();
+        // No appendonlydir yet → false.
+        assert!(!AofManifest::is_legacy_top_level_layout(&dir));
+
+        // After v1 initialize, top-level files present → true.
+        let _m = AofManifest::initialize(&dir).expect("init v1");
+        assert!(AofManifest::is_legacy_top_level_layout(&dir));
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn is_legacy_top_level_layout_returns_false_for_v2() {
+        let dir = temp_dir();
+        let _m = AofManifest::initialize_multi(&dir, 2).expect("init v2");
+        assert!(
+            !AofManifest::is_legacy_top_level_layout(&dir),
+            "v2 layout has no top-level moon.aof.* files"
+        );
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn parse_v2_rejects_shard_count_mismatch_in_file() {
+        let dir = temp_dir();
+        let aof = dir.join(AOF_DIR_NAME);
+        fs::create_dir_all(&aof).unwrap();
+        // Manifest claims shards 3 but only declares two shard records.
+        fs::write(
+            aof.join(MANIFEST_NAME),
+            "version 2\nseq 1\nshards 3\nshard 0 max_lsn 0\nshard 1 max_lsn 0\n",
+        )
+        .unwrap();
+
+        let err = AofManifest::load(&dir).expect_err("should reject");
+        let msg = err.to_string();
+        assert!(
+            msg.contains("declares shards=3 but has 2 shard records"),
+            "got: {}",
+            msg
+        );
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn parse_v2_rejects_non_contiguous_shard_ids() {
+        let dir = temp_dir();
+        let aof = dir.join(AOF_DIR_NAME);
+        fs::create_dir_all(&aof).unwrap();
+        // shards=2 but ids are {0, 2} not {0, 1}.
+        fs::write(
+            aof.join(MANIFEST_NAME),
+            "version 2\nseq 1\nshards 2\nshard 0 max_lsn 0\nshard 2 max_lsn 0\n",
+        )
+        .unwrap();
+
+        let err = AofManifest::load(&dir).expect_err("should reject");
+        let msg = err.to_string();
+        assert!(msg.contains("non-contiguous shard ids"), "got: {}", msg);
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    // ------------------------------------------------------------------
+    // Reviewer-flagged fixes: layout-aware path helpers + migration
+    // rollback. See the "Verify findings against current code" review
+    // comment on aof_manifest.rs:669-775 and :688-717.
+    // ------------------------------------------------------------------
+
+    #[test]
+    fn base_incr_paths_route_to_shard_zero_after_migration() {
+        let dir = temp_dir();
+        let mut m = AofManifest::initialize(&dir).expect("init v1");
+        // Pre-migration: TopLevel paths under appendonlydir/ directly.
+        assert_eq!(m.base_path(), m.aof_dir().join("moon.aof.1.base.rdb"));
+        assert_eq!(m.incr_path(), m.aof_dir().join("moon.aof.1.incr.aof"));
+
+        m.migrate_top_level_to_per_shard().expect("migrate");
+
+        // Post-migration: single-file helpers must route to shard-0/ so
+        // replay_multi_part and advance() find the actual files. This is
+        // the bug the reviewer flagged for aof_manifest.rs:669-775.
+        let shard0 = m.aof_dir().join("shard-0");
+        assert_eq!(m.base_path(), shard0.join("moon.aof.1.base.rdb"));
+        assert_eq!(m.incr_path(), shard0.join("moon.aof.1.incr.aof"));
+        assert_eq!(m.base_path_seq(7), shard0.join("moon.aof.7.base.rdb"));
+        assert_eq!(m.incr_path_seq(7), shard0.join("moon.aof.7.incr.aof"));
+        // The path the helper returns must be where the file actually lives.
+        assert!(m.base_path().exists(), "base file at returned path");
+        assert!(m.incr_path().exists(), "incr file at returned path");
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn migrate_rolls_back_filesystem_when_incr_rename_fails() {
+        // Simulate the rename(old_incr → new_incr) failure path by making
+        // the destination already exist as a directory (rename onto a
+        // non-empty directory is an error on every supported OS).
+        let dir = temp_dir();
+        let mut m = AofManifest::initialize(&dir).expect("init v1");
+        let original_base = m.aof_dir().join("moon.aof.1.base.rdb");
+        let original_incr = m.aof_dir().join("moon.aof.1.incr.aof");
+        fs::write(&original_incr, b"INCR_MARKER").expect("seed incr");
+        let base_bytes_before = fs::read(&original_base).expect("read base");
+
+        // Pre-create shard-0/moon.aof.1.incr.aof as a DIRECTORY so the
+        // rename fails after the base rename has already succeeded.
+        let shard0 = m.aof_dir().join("shard-0");
+        fs::create_dir_all(shard0.join("moon.aof.1.incr.aof")).expect("seed blocker");
+
+        let err = m
+            .migrate_top_level_to_per_shard()
+            .expect_err("incr rename should fail");
+        let _ = err; // exact error kind depends on OS
+
+        // Rollback invariants:
+        //   1. Layout stays TopLevel in memory.
+        //   2. base file restored to its original TopLevel path.
+        //   3. base file contents unchanged.
+        //   4. on-disk manifest is still v1 (load returns layout TopLevel).
+        assert_eq!(m.layout, AofLayout::TopLevel, "in-memory layout reverted");
+        assert!(original_base.exists(), "base restored to top-level");
+        let base_bytes_after = fs::read(&original_base).expect("read base");
+        assert_eq!(base_bytes_after, base_bytes_before, "base contents intact");
+        let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+        assert_eq!(reloaded.layout, AofLayout::TopLevel, "on-disk manifest v1");
+
+        fs::remove_dir_all(&dir).ok();
+    }
+
+    #[test]
+    fn migrate_does_not_mutate_on_missing_base() {
+        let dir = temp_dir();
+        let mut m = AofManifest::initialize(&dir).expect("init v1");
+        let base = m.aof_dir().join("moon.aof.1.base.rdb");
+        fs::remove_file(&base).expect("remove base");
+
+        let err = m
+            .migrate_top_level_to_per_shard()
+            .expect_err("missing base should fail");
+        assert_eq!(err.kind(), std::io::ErrorKind::NotFound);
+        // Layout never flipped, no rollback needed.
+        assert_eq!(m.layout, AofLayout::TopLevel);
+
+        fs::remove_dir_all(&dir).ok();
+    }
+}
diff --git a/src/replication/state.rs b/src/replication/state.rs
index e5e357c0..1a596faa 100644
--- a/src/replication/state.rs
+++ b/src/replication/state.rs
@@ -113,10 +113,34 @@ impl ReplicationState {
     /// Increment the offset for the given shard by delta bytes.
     /// Also adds delta to master_repl_offset.
     pub fn increment_shard_offset(&self, shard_id: usize, delta: u64) {
-        if shard_id < self.shard_offsets.len() {
-            self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed);
-            self.master_repl_offset.fetch_add(delta, Ordering::Relaxed);
+        let _ = self.issue_lsn(shard_id, delta);
+    }
+
+    /// Atomically issue an LSN for a write and advance per-shard +
+    /// master replication offsets by `delta`.
+    ///
+    /// Returns the LSN that uniquely identifies this write — equal to the
+    /// value of `master_repl_offset` BEFORE the increment, mirroring Redis's
+    /// `+ delta - delta` semantics. The same LSN MUST tag the corresponding
+    /// `AofMessage::Append` entry and the replication backlog entry for that
+    /// write so per-shard AOF replay can rebuild a globally consistent log
+    /// (per-shard AOF RFC § 2 Rule 3).
+    ///
+    /// Atomicity caveat: the per-shard offset advance and the master offset
+    /// advance are TWO separate `fetch_add`s, not one composite op. Concurrent
+    /// callers across shards observe a brief window where the master sum
+    /// disagrees with the sum of shard offsets. Acceptable today because the
+    /// only `total_offset()` consumer is INFO replication, which tolerates
+    /// transient skew. Do not promote to a hard invariant without redesign.
+    ///
+    /// Returns 0 if `shard_id` is out of range (defensive; production callers
+    /// must pass a valid id).
+    pub fn issue_lsn(&self, shard_id: usize, delta: u64) -> u64 {
+        if shard_id >= self.shard_offsets.len() {
+            return 0;
         }
+        self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed);
+        self.master_repl_offset.fetch_add(delta, Ordering::Relaxed)
     }
 
     /// Returns sum of all per-shard offsets.
diff --git a/src/server/conn/blocking.rs b/src/server/conn/blocking.rs
index 8846d0c8..7cf92727 100644
--- a/src/server/conn/blocking.rs
+++ b/src/server/conn/blocking.rs
@@ -1135,7 +1135,10 @@ pub(crate) fn try_inline_dispatch(
     shard_databases: &std::sync::Arc<ShardDatabases>,
     shard_id: usize,
     selected_db: usize,
-    aof_tx: &Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+    aof_pool: &Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>>,
+    repl_state: &Option<
+        std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
+    >,
     now_ms: u64,
     num_shards: usize,
     can_inline_writes: bool,
@@ -1346,8 +1349,20 @@ pub(crate) fn try_inline_dispatch(
     }
 
     // AOF: reuse the frozen RESP bytes directly (Arc clone, zero-copy).
-    if let Some(tx) = aof_tx {
-        let _ = tx.try_send(crate::persistence::aof::AofMessage::Append(frozen));
+    // This path is monoio inline GET/SET — the writer for the local shard
+    // (shard_id) owns the AOF record; under PerShard layout that routes
+    // to shard_id's writer. LSN must be sourced from `repl_state` so the
+    // inline path's writes share an LSN namespace with the non-inline path
+    // — otherwise per-shard replay merge in RFC step 5 would see two
+    // disjoint LSN streams per shard. Cost: one extra read-lock acquire
+    // (uncontended) + one atomic fetch_add per inline SET.
+    if let Some(pool) = aof_pool {
+        let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(
+            repl_state,
+            shard_id,
+            frozen.len(),
+        );
+        pool.try_send_append(shard_id, lsn, frozen);
     }
 
     write_buf.extend_from_slice(b"+OK\r\n");
@@ -1363,7 +1378,10 @@ pub(crate) fn try_inline_dispatch_loop(
     shard_databases: &std::sync::Arc<ShardDatabases>,
     shard_id: usize,
     selected_db: usize,
-    aof_tx: &Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+    aof_pool: &Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>>,
+    repl_state: &Option<
+        std::sync::Arc<std::sync::RwLock<crate::replication::state::ReplicationState>>,
+    >,
     now_ms: u64,
     num_shards: usize,
     can_inline_writes: bool,
@@ -1377,7 +1395,8 @@ pub(crate) fn try_inline_dispatch_loop(
             shard_databases,
             shard_id,
             selected_db,
-            aof_tx,
+            aof_pool,
+            repl_state,
             now_ms,
             num_shards,
             can_inline_writes,
diff --git a/src/server/conn/core.rs b/src/server/conn/core.rs
index 43cef452..dc2e2365 100644
--- a/src/server/conn/core.rs
+++ b/src/server/conn/core.rs
@@ -20,7 +20,7 @@ use std::sync::Arc;
 use crate::acl::{AclLog, AclTable};
 use crate::blocking::BlockingRegistry;
 use crate::config::{RuntimeConfig, ServerConfig};
-use crate::persistence::aof::AofMessage;
+use crate::persistence::aof::AofWriterPool;
 use crate::protocol::Frame;
 use crate::pubsub::PubSubRegistry;
 use crate::runtime::channel;
@@ -47,7 +47,13 @@ pub(crate) struct ConnectionContext {
     pub pubsub_registry: Arc<parking_lot::RwLock<PubSubRegistry>>,
     pub blocking_registry: Rc<RefCell<BlockingRegistry>>,
     pub requirepass: Option<String>,
-    pub aof_tx: Option<channel::MpscSender<AofMessage>>,
+    /// AOF writer pool — the **sole AOF interface** after the 2d/2e migration
+    /// sequence. Built by spawn sites in `shard/conn_accept.rs` from the
+    /// on-disk manifest layout: TopLevel wraps a single shared writer,
+    /// PerShard owns one sender per shard. `try_send_append(shard_id, bytes)`
+    /// routes to the owning shard; `try_send_rewrite(msg)` rejects under
+    /// PerShard until per-shard rewrite ships (step 6 of the RFC).
+    pub aof_pool: Option<Arc<AofWriterPool>>,
     pub tracking_table: Rc<RefCell<TrackingTable>>,
     pub repl_state: Option<Arc<StdRwLock<crate::replication::state::ReplicationState>>>,
     /// Lock-free mirror of `repl_state.role == Replica { .. }`.
@@ -96,7 +102,7 @@ impl ConnectionContext {
         pubsub_registry: Arc<parking_lot::RwLock<PubSubRegistry>>,
         blocking_registry: Rc<RefCell<BlockingRegistry>>,
         requirepass: Option<String>,
-        aof_tx: Option<channel::MpscSender<AofMessage>>,
+        aof_pool: Option<Arc<AofWriterPool>>,
         tracking_table: Rc<RefCell<TrackingTable>>,
         repl_state: Option<Arc<StdRwLock<crate::replication::state::ReplicationState>>>,
         cluster_state: Option<Arc<StdRwLock<crate::cluster::ClusterState>>>,
@@ -136,7 +142,7 @@ impl ConnectionContext {
             pubsub_registry,
             blocking_registry,
             requirepass,
-            aof_tx,
+            aof_pool,
             tracking_table,
             repl_state,
             is_replica_mirror,
diff --git a/src/server/conn/handler_monoio/dispatch.rs b/src/server/conn/handler_monoio/dispatch.rs
index 7182cd68..b4e5485f 100644
--- a/src/server/conn/handler_monoio/dispatch.rs
+++ b/src/server/conn/handler_monoio/dispatch.rs
@@ -978,9 +978,9 @@ pub(super) fn try_handle_persistence(
         return true;
     }
     if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") {
-        if let Some(ref tx) = ctx.aof_tx {
+        if let Some(ref pool) = ctx.aof_pool {
             responses.push(crate::command::persistence::bgrewriteaof_start_sharded(
-                tx,
+                pool,
                 ctx.shard_databases.clone(),
             ));
         } else {
diff --git a/src/server/conn/handler_monoio/mod.rs b/src/server/conn/handler_monoio/mod.rs
index 72eb1413..f2d8debb 100644
--- a/src/server/conn/handler_monoio/mod.rs
+++ b/src/server/conn/handler_monoio/mod.rs
@@ -20,7 +20,7 @@ use std::rc::Rc;
 
 use crate::command::metadata;
 use crate::command::{DispatchResult, dispatch, dispatch_read, is_dispatch_read_supported};
-use crate::persistence::aof::{self, AofMessage};
+use crate::persistence::aof;
 use crate::protocol::Frame;
 use crate::shard::dispatch::key_to_shard;
 use crate::shard::mesh::ChannelMesh;
@@ -483,7 +483,8 @@ pub(crate) async fn handle_connection_sharded_monoio<
                 &ctx.shard_databases,
                 ctx.shard_id,
                 conn.selected_db,
-                &ctx.aof_tx,
+                &ctx.aof_pool,
+                &ctx.repl_state,
                 ctx.cached_clock.ms(),
                 ctx.num_shards,
                 can_inline_writes,
@@ -1066,7 +1067,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
             }
 
             // Pre-classify write commands for AOF + tracking
-            let is_write = if ctx.aof_tx.is_some() || conn.tracking_state.enabled {
+            let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled {
                 metadata::is_write(cmd)
             } else {
                 false
@@ -1121,9 +1122,14 @@ pub(crate) async fn handle_connection_sharded_monoio<
                     };
                     // AOF only on actual success (:1). Matches handler_single.
                     if matches!(response, Frame::Integer(1)) {
-                        if let Some(ref tx) = ctx.aof_tx {
+                        if let Some(ref pool) = ctx.aof_pool {
                             let serialized = aof::serialize_command(&frame);
-                            let _ = tx.try_send(AofMessage::Append(serialized));
+                            let lsn = aof::AofWriterPool::issue_append_lsn(
+                                &ctx.repl_state,
+                                ctx.shard_id,
+                                serialized.len(),
+                            );
+                            pool.try_send_append(ctx.shard_id, lsn, serialized);
                         }
                     }
                     responses.push(response);
@@ -1186,9 +1192,14 @@ pub(crate) async fn handle_connection_sharded_monoio<
                         // AOF only on actual success (:1). Matches handler_single
                         // — `:0` (key absent / dst exists w/o REPLACE) is a no-op.
                         if matches!(response, Frame::Integer(1)) {
-                            if let Some(ref tx) = ctx.aof_tx {
+                            if let Some(ref pool) = ctx.aof_pool {
                                 let serialized = aof::serialize_command(&frame);
-                                let _ = tx.try_send(AofMessage::Append(serialized));
+                                let lsn = aof::AofWriterPool::issue_append_lsn(
+                                    &ctx.repl_state,
+                                    ctx.shard_id,
+                                    serialized.len(),
+                                );
+                                pool.try_send_append(ctx.shard_id, lsn, serialized);
                             }
                         }
                         responses.push(response);
@@ -1535,9 +1546,14 @@ pub(crate) async fn handle_connection_sharded_monoio<
 
                     // AOF logging for successful local writes
                     if !matches!(response, Frame::Error(_)) && is_write {
-                        if let Some(ref tx) = ctx.aof_tx {
+                        if let Some(ref pool) = ctx.aof_pool {
                             let serialized = aof::serialize_command(&frame);
-                            let _ = tx.try_send(AofMessage::Append(serialized));
+                            let lsn = aof::AofWriterPool::issue_append_lsn(
+                                &ctx.repl_state,
+                                ctx.shard_id,
+                                serialized.len(),
+                            );
+                            pool.try_send_append(ctx.shard_id, lsn, serialized);
                         }
                     }
 
@@ -1768,7 +1784,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
                 let resp_idx = responses.len();
                 responses.push(Frame::Null); // placeholder, filled after batch dispatch
                 // Pre-compute AOF bytes before moving frame into Arc
-                let aof_bytes = if ctx.aof_tx.is_some() && metadata::is_write(cmd) {
+                let aof_bytes = if ctx.aof_pool.is_some() && metadata::is_write(cmd) {
                     Some(aof::serialize_command(&dispatch_frame))
                 } else {
                     None
@@ -1837,7 +1853,11 @@ pub(crate) async fn handle_connection_sharded_monoio<
         if !remote_groups.is_empty() {
             reply_futures.clear();
 
+            // Capture `target` per batch so the cross-shard AOF write at the bottom
+            // of the loop can route to the owning shard's pool (not ctx.shard_id —
+            // mirrors the load-bearing fix at handler_sharded/mod.rs:1651).
             let mut oneshot_futures: Vec<(
+                usize, // target shard — owner for AOF append
                 Vec<(usize, Option<Bytes>, Bytes)>,
                 channel::OneshotReceiver<Vec<Frame>>,
             )> = Vec::new();
@@ -1881,7 +1901,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
                         }
                     }
                 }
-                oneshot_futures.push((meta, reply_rx));
+                oneshot_futures.push((target, meta, reply_rx));
             }
 
             // Poll all shard responses via pending_wakers relay (monoio cross-thread waker fix).
@@ -1889,7 +1909,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
             // Instead, the connection task registers its waker in pending_wakers; the event
             // loop drains and wakes them after every SPSC cycle (~1ms). On wake, try_recv()
             // checks if the response arrived; if not, re-register and yield again.
-            for (meta, reply_rx) in oneshot_futures.drain(..) {
+            for (target, meta, reply_rx) in oneshot_futures.drain(..) {
                 tracing::trace!(
                     "Shard {}: awaiting cross-shard response via pending_wakers",
                     ctx.shard_id
@@ -1931,11 +1951,24 @@ pub(crate) async fn handle_connection_sharded_monoio<
                 };
                 for ((resp_idx, aof_bytes, cmd_name), resp) in meta.into_iter().zip(shard_responses)
                 {
-                    // AOF logging for successful remote writes
+                    // AOF logging for successful remote writes.
+                    // Owner shard is `target` (NOT ctx.shard_id) — under PerShard
+                    // layout the write must land in the target shard's AOF file
+                    // since that shard owns the mutated data. Mirrors the
+                    // load-bearing fix at handler_sharded/mod.rs:1651.
                     if let Some(bytes) = aof_bytes {
                         if !matches!(resp, Frame::Error(_)) {
-                            if let Some(ref tx) = ctx.aof_tx {
-                                let _ = tx.try_send(AofMessage::Append(bytes));
+                            if let Some(ref pool) = ctx.aof_pool {
+                                // Cross-shard write: LSN must be sourced
+                                // using `target`'s shard_id so the
+                                // per-shard offset increment lands on the
+                                // shard that owns the mutated data.
+                                let lsn = aof::AofWriterPool::issue_append_lsn(
+                                    &ctx.repl_state,
+                                    target,
+                                    bytes.len(),
+                                );
+                                pool.try_send_append(target, lsn, bytes);
                             }
                         }
                     }
diff --git a/src/server/conn/handler_sharded/dispatch.rs b/src/server/conn/handler_sharded/dispatch.rs
index e73b77f2..bcd4e98a 100644
--- a/src/server/conn/handler_sharded/dispatch.rs
+++ b/src/server/conn/handler_sharded/dispatch.rs
@@ -353,9 +353,9 @@ pub(super) fn try_handle_persistence(
         return true;
     }
     if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") {
-        if let Some(ref tx) = ctx.aof_tx {
+        if let Some(ref pool) = ctx.aof_pool {
             responses.push(crate::command::persistence::bgrewriteaof_start_sharded(
-                tx,
+                pool,
                 ctx.shard_databases.clone(),
             ));
         } else {
diff --git a/src/server/conn/handler_sharded/mod.rs b/src/server/conn/handler_sharded/mod.rs
index 83256ca8..7f41d655 100644
--- a/src/server/conn/handler_sharded/mod.rs
+++ b/src/server/conn/handler_sharded/mod.rs
@@ -15,7 +15,7 @@ use std::collections::HashMap;
 use crate::command::connection as conn_cmd;
 use crate::command::metadata;
 use crate::command::{DispatchResult, dispatch, dispatch_read};
-use crate::persistence::aof::{self, AofMessage};
+use crate::persistence::aof;
 use crate::protocol::Frame;
 use crate::shard::dispatch::{ShardMessage, key_to_shard};
 use crate::shard::mesh::ChannelMesh;
@@ -1119,8 +1119,8 @@ pub(crate) async fn handle_connection_sharded_inner<
                         }
                     }
 
-                    let is_write = if ctx.aof_tx.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false };
-                    let aof_bytes = if is_write && ctx.aof_tx.is_some() { Some(aof::serialize_command(&frame)) } else { None };
+                    let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false };
+                    let aof_bytes = if is_write && ctx.aof_pool.is_some() { Some(aof::serialize_command(&frame)) } else { None };
 
                     if is_local {
                         // LOCAL PATH: split into read/write to avoid exclusive lock on reads.
@@ -1172,7 +1172,10 @@ pub(crate) async fn handle_connection_sharded_inner<
                             // — `:0` (key absent) is a no-op and must not log.
                             if matches!(response, Frame::Integer(1)) {
                                 if let Some(ref bytes) = aof_bytes {
-                                    if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes.clone())); }
+                                    if let Some(ref pool) = ctx.aof_pool {
+                                        let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len());
+                                        pool.try_send_append(ctx.shard_id, lsn, bytes.clone());
+                                    }
                                 }
                             }
                             responses.push(response);
@@ -1216,7 +1219,10 @@ pub(crate) async fn handle_connection_sharded_inner<
                                 // — `:0` (key absent / dst exists w/o REPLACE) is a no-op.
                                 if matches!(response, Frame::Integer(1)) {
                                     if let Some(ref bytes) = aof_bytes {
-                                        if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes.clone())); }
+                                        if let Some(ref pool) = ctx.aof_pool {
+                                            let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len());
+                                            pool.try_send_append(ctx.shard_id, lsn, bytes.clone());
+                                        }
                                     }
                                 }
                                 responses.push(response);
@@ -1427,7 +1433,10 @@ pub(crate) async fn handle_connection_sharded_inner<
                             }
                             if let Some(bytes) = aof_bytes {
                                 if !matches!(response, Frame::Error(_)) {
-                                    if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); }
+                                    if let Some(ref pool) = ctx.aof_pool {
+                                        let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len());
+                                        pool.try_send_append(ctx.shard_id, lsn, bytes);
+                                    }
                                 }
                             }
                             if conn.tracking_state.enabled && !matches!(response, Frame::Error(_)) {
@@ -1646,9 +1655,20 @@ pub(crate) async fn handle_connection_sharded_inner<
                     for (meta, target) in reply_futures {
                         let shard_responses = response_pool.future_for(target).await;
                         for ((resp_idx, aof_bytes, cmd_name), resp) in meta.into_iter().zip(shard_responses) {
+                            // AOF logging for successful remote writes.
+                            // Owner shard is `target` (NOT ctx.shard_id) — under PerShard
+                            // layout the write must land in the target shard's AOF file
+                            // since that shard owns the mutated data. This was the
+                            // pre-existing routing bug that motivated the per-shard AOF
+                            // RFC (Option B): under TopLevel a single writer absorbed
+                            // every cross-shard append, masking the wrong-owner write.
                             if let Some(bytes) = aof_bytes {
                                 if !matches!(resp, Frame::Error(_)) {
-                                    if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); }
+                                    if let Some(ref pool) = ctx.aof_pool {
+                                        // Cross-shard: LSN sourced for `target`.
+                                        let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, target, bytes.len());
+                                        pool.try_send_append(target, lsn, bytes);
+                                    }
                                 }
                             }
                             responses[resp_idx] = apply_resp3_conversion(&cmd_name, resp, proto_ver);
diff --git a/src/server/conn/handler_single.rs b/src/server/conn/handler_single.rs
index 4a626cc7..6aa4f3ad 100644
--- a/src/server/conn/handler_single.rs
+++ b/src/server/conn/handler_single.rs
@@ -42,7 +42,10 @@ use crate::server::codec::RespCodec;
 /// When `requirepass` is set, clients must authenticate via AUTH before any other
 /// commands are accepted (except QUIT).
 ///
-/// When `aof_tx` is provided, write commands are logged to the AOF file.
+/// When `aof_pool` is provided, write commands are logged via the per-shard
+/// AOF writer pool. handler_single is single-shard mode by definition
+/// (num_shards = 1, shard_id = 0), so the pool is always a TopLevel layout
+/// wrapping a single writer sender — see `AofWriterPool::top_level`.
 /// When `change_counter` is provided, write commands increment the counter for auto-save.
 ///
 /// Supports Pub/Sub subscriber mode: when a client subscribes to channels/patterns,
@@ -61,7 +64,7 @@ pub async fn handle_connection(
     shutdown: CancellationToken,
     requirepass: Option<String>,
     config: Arc<ServerConfig>,
-    aof_tx: Option<channel::MpscSender<AofMessage>>,
+    aof_pool: Option<Arc<crate::persistence::aof::AofWriterPool>>,
     change_counter: Option<Arc<AtomicU64>>,
     pubsub_registry: Arc<Mutex<PubSubRegistry>>,
     runtime_config: Arc<parking_lot::RwLock<RuntimeConfig>>,
@@ -608,8 +611,8 @@ pub async fn handle_connection(
                         }
                         // BGREWRITEAOF
                         if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") {
-                            let response = if let Some(ref tx) = aof_tx {
-                                crate::command::persistence::bgrewriteaof_start(tx, db.clone())
+                            let response = if let Some(ref pool) = aof_pool {
+                                crate::command::persistence::bgrewriteaof_start(pool, db.clone())
                             } else {
                                 Frame::Error(Bytes::from_static(b"ERR AOF is not enabled"))
                             };
@@ -655,7 +658,11 @@ pub async fn handle_connection(
                                         // WAL must be durable BEFORE the swap (no rollback
                                         // path for SWAPDB). Try-send first; on failure return
                                         // an error and leave both DBs untouched.
-                                        let wal_ok = if let Some(ref tx) = aof_tx {
+                                        // Drop down to the pool sender so we can still observe
+                                        // try_send's Result (the fire-and-forget
+                                        // pool.try_send_append loses the SendFailed signal we
+                                        // need to abort the swap cleanly).
+                                        let wal_ok = if let Some(ref pool) = aof_pool {
                                             let mut a_buf = itoa::Buffer::new();
                                             let mut b_buf = itoa::Buffer::new();
                                             let wal_frame = Frame::Array(crate::framevec![
@@ -671,12 +678,16 @@ pub async fn handle_connection(
                                                 crate::persistence::aof::serialize_command(
                                                     &wal_frame,
                                                 );
-                                            tx.try_send(
-                                                crate::persistence::aof::AofMessage::Append(
-                                                    serialized,
-                                                ),
-                                            )
-                                            .is_ok()
+                                            // Single-shard mode — shard_id = 0.
+                                            let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, serialized.len());
+                                            pool.sender(0)
+                                                .try_send(
+                                                    crate::persistence::aof::AofMessage::Append {
+                                                        lsn,
+                                                        bytes: serialized,
+                                                    },
+                                                )
+                                                .is_ok()
                                         } else {
                                             true // persistence disabled — no durability requirement
                                         };
@@ -878,8 +889,16 @@ pub async fn handle_connection(
                             }
                             // Send AOF entries accumulated so far
                             for bytes in aof_entries.drain(..) {
-                                if let Some(ref tx) = aof_tx {
-                                    let _ = tx.send_async(AofMessage::Append(bytes)).await;
+                                if let Some(ref pool) = aof_pool {
+                                    // Single-shard mode (shard_id = 0). send_async
+                                    // preserves back-pressure semantics from the
+                                    // pre-pool code; the pool's TopLevel layout
+                                    // routes to the same single writer.
+                                    let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len());
+                                    let _ = pool
+                                        .sender(0)
+                                        .send_async(AofMessage::Append { lsn, bytes })
+                                        .await;
                                 }
                                 if let Some(ref counter) = change_counter {
                                     counter.fetch_add(1, Ordering::Relaxed);
@@ -1510,8 +1529,14 @@ pub async fn handle_connection(
                                         (resp, records)
                                     };
                                     for record in wal_records {
-                                        if let Some(ref tx) = aof_tx {
-                                            let _ = tx.send_async(AofMessage::Append(bytes::Bytes::from(record))).await;
+                                        if let Some(ref pool) = aof_pool {
+                                            // Single-shard mode (shard_id = 0).
+                                            let bytes = bytes::Bytes::from(record);
+                                            let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len());
+                                            let _ = pool
+                                                .sender(0)
+                                                .send_async(AofMessage::Append { lsn, bytes })
+                                                .await;
                                         }
                                         if let Some(ref counter) = change_counter {
                                             counter.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
@@ -1528,7 +1553,7 @@ pub async fn handle_connection(
                             let is_write = metadata::is_write(cmd);
 
                             // Serialize for AOF before dispatch
-                            let aof_bytes = if is_write && aof_tx.is_some() {
+                            let aof_bytes = if is_write && aof_pool.is_some() {
                                 let mut buf = BytesMut::new();
                                 crate::protocol::serialize::serialize(&frame, &mut buf);
                                 Some(buf.freeze())
@@ -2232,8 +2257,14 @@ pub async fn handle_connection(
 
                 // --- Send AOF entries OUTSIDE the lock ---
                 for bytes in aof_entries {
-                    if let Some(ref tx) = aof_tx {
-                        let _ = tx.send_async(AofMessage::Append(bytes)).await;
+                    if let Some(ref pool) = aof_pool {
+                        // Single-shard mode (shard_id = 0). send_async preserves
+                        // back-pressure semantics from the pre-pool code.
+                        let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len());
+                        let _ = pool
+                            .sender(0)
+                            .send_async(AofMessage::Append { lsn, bytes })
+                            .await;
                     }
                     if let Some(ref counter) = change_counter {
                         counter.fetch_add(1, Ordering::Relaxed);
diff --git a/src/server/conn/tests.rs b/src/server/conn/tests.rs
index 2f6c74e1..8a175eaf 100644
--- a/src/server/conn/tests.rs
+++ b/src/server/conn/tests.rs
@@ -31,7 +31,7 @@ fn test_inline_get_hit() {
     }
     let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"[..]);
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -40,7 +40,8 @@ fn test_inline_get_hit() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -56,7 +57,7 @@ fn test_inline_get_miss() {
     let dbs = make_dbs();
     let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"[..]);
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -65,7 +66,8 @@ fn test_inline_get_miss() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -84,7 +86,7 @@ fn test_inline_set_falls_through_when_writes_disabled() {
     let mut read_buf = BytesMut::from(&cmd[..]);
     let original_len = read_buf.len();
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -93,7 +95,8 @@ fn test_inline_set_falls_through_when_writes_disabled() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -111,7 +114,7 @@ fn test_inline_set_executes_when_writes_enabled() {
     let cmd = b"*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n";
     let mut read_buf = BytesMut::from(&cmd[..]);
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -120,7 +123,8 @@ fn test_inline_set_executes_when_writes_enabled() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         true,
@@ -150,7 +154,7 @@ fn test_inline_set_with_options_falls_through() {
     let mut read_buf = BytesMut::from(&cmd[..]);
     let original_len = read_buf.len();
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -159,7 +163,8 @@ fn test_inline_set_with_options_falls_through() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         true,
@@ -176,7 +181,7 @@ fn test_inline_fallthrough() {
     let mut read_buf = BytesMut::from(&ping_cmd[..]);
     let original_len = read_buf.len();
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -185,7 +190,8 @@ fn test_inline_fallthrough() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -211,7 +217,7 @@ fn test_inline_mixed_batch() {
     read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n");
     read_buf.extend_from_slice(b"*1\r\n$4\r\nPING\r\n");
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     // Inline loop should process GET but leave PING
@@ -221,7 +227,8 @@ fn test_inline_mixed_batch() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -244,7 +251,7 @@ fn test_inline_case_insensitive() {
     }
     let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nget\r\n$3\r\nfoo\r\n"[..]);
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -253,7 +260,8 @@ fn test_inline_case_insensitive() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -271,7 +279,7 @@ fn test_inline_partial() {
     let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\n"[..]);
     let original_len = read_buf.len();
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let result = try_inline_dispatch(
@@ -280,7 +288,8 @@ fn test_inline_partial() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -296,7 +305,8 @@ fn test_inline_set_with_aof_falls_through_when_writes_disabled() {
     // SET falls through when can_inline_writes=false even with AOF.
     let dbs = make_dbs();
     let (aof_sender, _aof_receiver) = channel::mpsc_bounded::<AofMessage>(16);
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = Some(aof_sender);
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> =
+        Some(crate::persistence::aof::AofWriterPool::top_level(aof_sender));
     let cmd = b"*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n";
     let mut read_buf = BytesMut::from(&cmd[..]);
     let original_len = read_buf.len();
@@ -310,7 +320,8 @@ fn test_inline_set_with_aof_falls_through_when_writes_disabled() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
@@ -342,7 +353,7 @@ fn test_inline_multiple_gets() {
     read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$1\r\na\r\n");
     read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$1\r\nb\r\n");
     let mut write_buf = BytesMut::new();
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = None;
+    let aof_pool: Option<std::sync::Arc<crate::persistence::aof::AofWriterPool>> = None;
     let rt_config = make_rt_config();
 
     let total = try_inline_dispatch_loop(
@@ -351,7 +362,8 @@ fn test_inline_multiple_gets() {
         &dbs,
         0,
         0,
-        &aof_tx,
+        &aof_pool,
+        &None,
         0,
         1,
         false,
diff --git a/src/server/conn_state.rs b/src/server/conn_state.rs
index 674857db..d2946f36 100644
--- a/src/server/conn_state.rs
+++ b/src/server/conn_state.rs
@@ -8,7 +8,7 @@ use ringbuf::HeapProd;
 use crate::blocking::BlockingRegistry;
 use crate::cluster::ClusterState;
 use crate::config::{RuntimeConfig, ServerConfig};
-use crate::persistence::aof::AofMessage;
+use crate::persistence::aof::AofWriterPool;
 use crate::protocol::Frame;
 use crate::pubsub::PubSubRegistry;
 use crate::replication::state::ReplicationState;
@@ -38,7 +38,12 @@ pub struct ConnectionContext {
     pub blocking_registry: Rc<RefCell<BlockingRegistry>>,
     pub shutdown: CancellationToken,
     pub requirepass: Option<String>,
-    pub aof_tx: Option<channel::MpscSender<AofMessage>>,
+    /// Per-shard AOF writer pool. **Sole AOF interface** after the
+    /// 2d/2e migration sequence. Built by spawn sites in `shard/conn_accept.rs`
+    /// from the manifest layout — TopLevel wraps a single sender,
+    /// PerShard owns one sender per shard. Step 2f flips spawn sites
+    /// to construct PerShard pools when the on-disk manifest demands it.
+    pub aof_pool: Option<Arc<AofWriterPool>>,
     pub tracking_table: Rc<RefCell<TrackingTable>>,
     pub repl_state: Option<Arc<RwLock<ReplicationState>>>,
     pub cluster_state: Option<Arc<RwLock<ClusterState>>>,
diff --git a/src/server/embedded.rs b/src/server/embedded.rs
index 55a2b3fd..e1620fa3 100644
--- a/src/server/embedded.rs
+++ b/src/server/embedded.rs
@@ -48,7 +48,7 @@ use parking_lot::RwLock;
 use tracing::info;
 
 use crate::config::ServerConfig;
-use crate::persistence::aof::{self, AofMessage, FsyncPolicy};
+use crate::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy};
 use crate::runtime::cancel::CancellationToken;
 use crate::runtime::channel;
 use crate::runtime::{RuntimeFactoryImpl, traits::RuntimeFactory};
@@ -113,9 +113,12 @@ pub async fn run_embedded(
     // AOF writer: dedicated std::thread (matches main.rs lifetime model).
     // We retain the JoinHandle so shutdown can wait for the writer to finish
     // flushing — dropping it would race the process exit and risk losing the
-    // final fsync (CodeRabbit #1).
-    let (aof_tx, aof_join): (
-        Option<channel::MpscSender<AofMessage>>,
+    // final fsync (CodeRabbit #1). Step 2f-α: wrap the sender in a TopLevel
+    // `AofWriterPool` so every shard receives an Arc clone with a uniform
+    // API. Channel close still drives writer termination — dropping the last
+    // Arc drops the pool, which drops the underlying senders.
+    let (aof_pool, aof_join): (
+        Option<Arc<AofWriterPool>>,
         Option<std::thread::JoinHandle<()>>,
     ) = if config.appendonly == "yes" {
         let (tx, rx) = channel::mpsc_bounded::<AofMessage>(10_000);
@@ -132,7 +135,7 @@ pub async fn run_embedded(
             })
             .context("embedded moon: failed to spawn AOF writer thread")?;
         info!("embedded moon: AOF enabled (fsync: {:?})", fsync);
-        (Some(tx), Some(handle))
+        (Some(AofWriterPool::top_level(tx)), Some(handle))
     } else {
         (None, None)
     };
@@ -263,7 +266,7 @@ pub async fn run_embedded(
         let consumers = mesh.take_consumers(id);
         let conn_rx = mesh.take_conn_rx(id);
         let shard_cancel = cancel.clone();
-        let shard_aof_tx = aof_tx.clone();
+        let shard_aof_pool = aof_pool.clone();
         let shard_bind_addr = bind_addr.clone();
         let shard_persistence_dir = persistence_dir.clone();
         let shard_snap_rx = snap_rx.clone();
@@ -293,7 +296,7 @@ pub async fn run_embedded(
                                 consumers,
                                 producers,
                                 shard_cancel,
-                                shard_aof_tx,
+                                shard_aof_pool,
                                 Some(shard_bind_addr),
                                 shard_persistence_dir,
                                 shard_snap_rx,
@@ -354,10 +357,12 @@ pub async fn run_embedded(
 
     // Listener exited (cancel fired or fatal error). Shutdown ordering:
     //   1. cancel.cancel() — stops shard accept loops + producers.
-    //   2. Join shard threads — drops every shard-held `aof_tx` clone.
-    //   3. Drop our outer `aof_tx` — last sender goes away, the AOF writer's
-    //      `recv_async()` returns `Err(_)` and the task flushes + fsyncs
-    //      before exiting (see `aof::aof_writer_task` Err arm).
+    //   2. Join shard threads — drops every shard-held `Arc<AofWriterPool>`
+    //      clone (each `ConnectionContext` and each `shard_aof_pool` capture).
+    //   3. Drop our outer `aof_pool` Arc — the last reference goes away, the
+    //      pool's `Drop` runs, dropping the underlying `Vec<MpscSender>`. The
+    //      AOF writer's `recv_async()` returns `Err(_)` and the task flushes
+    //      + fsyncs before exiting (see `aof::aof_writer_task` Err arm).
     //   4. Join the AOF thread.
     //
     // This sequencing fixes Qodo bug #5: sending `AofMessage::Shutdown` before
@@ -381,8 +386,9 @@ pub async fn run_embedded(
             }
         }
 
-        // Drop the last AOF sender so the writer's recv loop sees channel close.
-        drop(aof_tx);
+        // Drop the last `Arc<AofWriterPool>` so the underlying senders close,
+        // triggering the writer's recv loop to drain + fsync + exit.
+        drop(aof_pool);
 
         let aof_panic = if let Some(handle) = aof_join {
             match handle.join() {
diff --git a/src/server/listener.rs b/src/server/listener.rs
index 607aef54..37300cf2 100644
--- a/src/server/listener.rs
+++ b/src/server/listener.rs
@@ -13,7 +13,7 @@ use tracing::{debug, error, info};
 use crate::command::connection as conn_cmd;
 use crate::config::ServerConfig;
 #[cfg(feature = "runtime-tokio")]
-use crate::persistence::aof::{self, AofMessage, FsyncPolicy};
+use crate::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy};
 #[cfg(feature = "runtime-tokio")]
 use crate::persistence::rdb;
 #[cfg(feature = "runtime-tokio")]
@@ -115,15 +115,17 @@ pub async fn run_with_shutdown(
 
     let config = Arc::new(config);
 
-    // Set up AOF writer task if appendonly is enabled
-    let aof_tx: Option<channel::MpscSender<AofMessage>> = if config.appendonly == "yes" {
+    // Set up AOF writer task + wrap the sender as a TopLevel `AofWriterPool`
+    // (step 2f-α). Single-shard tokio path: `handler_single` receives the pool
+    // directly; per-shard layout is unreachable from this listener.
+    let aof_pool: Option<Arc<AofWriterPool>> = if config.appendonly == "yes" {
         let (tx, rx) = channel::mpsc_bounded::<AofMessage>(10_000);
         let aof_token = token.child_token();
         let fsync = FsyncPolicy::from_str(&config.appendfsync);
         let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename);
         tokio::spawn(aof::aof_writer_task(rx, aof_file_path, fsync, aof_token));
         info!("AOF enabled with fsync policy: {:?}", fsync);
-        Some(tx)
+        Some(AofWriterPool::top_level(tx))
     } else {
         None
     };
@@ -235,7 +237,7 @@ pub async fn run_with_shutdown(
                         let conn_token = token.child_token();
                         let requirepass = config.requirepass.clone();
                         let config = config.clone();
-                        let aof_tx = aof_tx.clone();
+                        let aof_pool_conn = aof_pool.clone();
                         let change_counter = Some(change_counter.clone());
                         let pubsub = pubsub_registry.clone();
                         let rt_config = runtime_config.clone();
@@ -249,7 +251,7 @@ pub async fn run_with_shutdown(
                         let gs = graph_store.clone();
                         tokio::spawn(connection::handle_connection(
                             stream, db, conn_token, requirepass, config,
-                            aof_tx, change_counter, pubsub, rt_config,
+                            aof_pool_conn, change_counter, pubsub, rt_config,
                             tracking, cid, Some(rs), acl, Some(vs),
                             Some(ts),
                             #[cfg(feature = "graph")]
@@ -263,9 +265,12 @@ pub async fn run_with_shutdown(
             }
             _ = token.cancelled() => {
                 info!("Server shutting down");
-                // Send shutdown to AOF writer
-                if let Some(ref tx) = aof_tx {
-                    let _ = tx.send_async(AofMessage::Shutdown).await;
+                // Fan out shutdown to every AOF writer (single writer under TopLevel,
+                // one-per-shard under PerShard — step 2f-β). `broadcast_shutdown`
+                // is `try_send`-based so the writer must be draining; under tokio
+                // listener it always is.
+                if let Some(ref pool) = aof_pool {
+                    pool.broadcast_shutdown();
                 }
                 break;
             }
diff --git a/src/shard/conn_accept.rs b/src/shard/conn_accept.rs
index 432d65aa..006fc885 100644
--- a/src/shard/conn_accept.rs
+++ b/src/shard/conn_accept.rs
@@ -104,7 +104,7 @@ pub(crate) fn spawn_tokio_connection(
     pubsub_arc: &Arc<parking_lot::RwLock<PubSubRegistry>>,
     blocking_rc: &Rc<RefCell<BlockingRegistry>>,
     shutdown: &CancellationToken,
-    aof_tx: &Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+    aof_pool: &Option<Arc<crate::persistence::aof::AofWriterPool>>,
     tracking_rc: &Rc<RefCell<TrackingTable>>,
     lua_rc: &Rc<RefCell<Option<Rc<mlua::Lua>>>>,
     script_cache_rc: &Rc<RefCell<crate::scripting::ScriptCache>>,
@@ -134,7 +134,6 @@ pub(crate) fn spawn_tokio_connection(
     let psr = pubsub_arc.clone();
     let blk = blocking_rc.clone();
     let sd = shutdown.clone();
-    let aof = aof_tx.clone();
     let trk = tracking_rc.clone();
     let cid = conn_cmd::next_client_id();
     let rs = repl_state.clone();
@@ -170,7 +169,10 @@ pub(crate) fn spawn_tokio_connection(
         set_tcp_keepalive(tcp_stream.as_raw_fd(), tcp_keepalive_secs);
     }
 
-    // Construct ConnectionContext from cloned shared state
+    // Construct ConnectionContext from cloned shared state. The pool is
+    // already built by the spawn site (main.rs / listener.rs / embedded.rs)
+    // and threaded through `Shard::run` — we just clone the Arc once here.
+    let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
     let conn_ctx = crate::server::conn::ConnectionContext::new(
         sdbs,
         shard_id,
@@ -178,7 +180,7 @@ pub(crate) fn spawn_tokio_connection(
         psr,
         blk,
         reqpass,
-        aof,
+        pool_for_ctx,
         trk,
         rs,
         cs,
@@ -272,7 +274,7 @@ pub(crate) fn spawn_migrated_tokio_connection(
     pubsub_arc: &Arc<parking_lot::RwLock<PubSubRegistry>>,
     blocking_rc: &Rc<RefCell<BlockingRegistry>>,
     shutdown: &CancellationToken,
-    aof_tx: &Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+    aof_pool: &Option<Arc<crate::persistence::aof::AofWriterPool>>,
     tracking_rc: &Rc<RefCell<TrackingTable>>,
     lua_rc: &Rc<RefCell<Option<Rc<mlua::Lua>>>>,
     script_cache_rc: &Rc<RefCell<crate::scripting::ScriptCache>>,
@@ -326,7 +328,6 @@ pub(crate) fn spawn_migrated_tokio_connection(
             let psr = pubsub_arc.clone();
             let blk = blocking_rc.clone();
             let sd = shutdown.clone();
-            let aof = aof_tx.clone();
             let trk = tracking_rc.clone();
             let cid = state.client_id;
             let rs = repl_state.clone();
@@ -354,6 +355,8 @@ pub(crate) fn spawn_migrated_tokio_connection(
 
             let migration_buf = take_migration_read_buf(&mut state);
 
+            // Pool is built by the spawn site and threaded through here.
+            let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
             let conn_ctx = crate::server::conn::ConnectionContext::new(
                 sdbs,
                 shard_id,
@@ -361,7 +364,7 @@ pub(crate) fn spawn_migrated_tokio_connection(
                 psr,
                 blk,
                 None, // requirepass: None = pre-authenticated
-                aof,
+                pool_for_ctx,
                 trk,
                 rs,
                 cs,
@@ -421,7 +424,7 @@ pub(crate) fn spawn_monoio_connection(
     pubsub_arc: &Arc<parking_lot::RwLock<PubSubRegistry>>,
     blocking_rc: &Rc<RefCell<BlockingRegistry>>,
     shutdown: &CancellationToken,
-    aof_tx: &Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+    aof_pool: &Option<Arc<crate::persistence::aof::AofWriterPool>>,
     tracking_rc: &Rc<RefCell<TrackingTable>>,
     lua_rc: &Rc<RefCell<Option<Rc<mlua::Lua>>>>,
     script_cache_rc: &Rc<RefCell<crate::scripting::ScriptCache>>,
@@ -464,7 +467,6 @@ pub(crate) fn spawn_monoio_connection(
             let psr = pubsub_arc.clone();
             let blk = blocking_rc.clone();
             let sd = shutdown.clone();
-            let aof = aof_tx.clone();
             let trk = tracking_rc.clone();
             let cid = conn_cmd::next_client_id();
             let rs = repl_state.clone();
@@ -500,12 +502,14 @@ pub(crate) fn spawn_monoio_connection(
                 .map(|a| a.to_string())
                 .unwrap_or_else(|_| "unknown".to_string());
 
-            // Construct ConnectionContext from cloned shared state
+            // Construct ConnectionContext from cloned shared state. Pool is
+            // built by the spawn site and threaded through here.
             let reqpass = rtcfg.read().requirepass.clone();
+            let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
             let conn_ctx = crate::server::conn::ConnectionContext::new(
-                sdbs, shard_id, num_shards, psr, blk, reqpass, aof, trk, rs, cs, lua, sc, cp, acl,
-                rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, all_regs, all_rsm, aff, spill_tx,
-                spill_fid, do_dir,
+                sdbs, shard_id, num_shards, psr, blk, reqpass, pool_for_ctx, trk, rs, cs, lua,
+                sc, cp, acl, rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, all_regs, all_rsm,
+                aff, spill_tx, spill_fid, do_dir,
             );
 
             let maxclients = conn_ctx.runtime_config.read().maxclients;
@@ -713,7 +717,7 @@ pub(crate) fn spawn_migrated_monoio_connection(
     pubsub_arc: &Arc<parking_lot::RwLock<PubSubRegistry>>,
     blocking_rc: &Rc<RefCell<BlockingRegistry>>,
     shutdown: &CancellationToken,
-    aof_tx: &Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+    aof_pool: &Option<Arc<crate::persistence::aof::AofWriterPool>>,
     tracking_rc: &Rc<RefCell<TrackingTable>>,
     lua_rc: &Rc<RefCell<Option<Rc<mlua::Lua>>>>,
     script_cache_rc: &Rc<RefCell<crate::scripting::ScriptCache>>,
@@ -766,7 +770,6 @@ pub(crate) fn spawn_migrated_monoio_connection(
             let psr = pubsub_arc.clone();
             let blk = blocking_rc.clone();
             let sd = shutdown.clone();
-            let aof = aof_tx.clone();
             let trk = tracking_rc.clone();
             let cid = state.client_id;
             let rs = repl_state.clone();
@@ -802,11 +805,13 @@ pub(crate) fn spawn_migrated_monoio_connection(
 
             let migration_buf = take_migration_read_buf(&mut state);
 
+            // Pool is built by the spawn site and threaded through here.
+            let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
             let conn_ctx = crate::server::conn::ConnectionContext::new(
                 sdbs, shard_id, num_shards, psr, blk,
                 None, // requirepass: None = pre-authenticated
-                aof, trk, rs, cs, lua, sc, cp, acl, rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm,
-                all_regs, all_rsm, aff, spill_tx, spill_fid, do_dir,
+                pool_for_ctx, trk, rs, cs, lua, sc, cp, acl, rtcfg, scfg, dtx, notifiers,
+                snap_tx, clk, rsm, all_regs, all_rsm, aff, spill_tx, spill_fid, do_dir,
             );
 
             monoio::spawn(async move {
diff --git a/src/shard/event_loop.rs b/src/shard/event_loop.rs
index 2b9de27d..5198d930 100644
--- a/src/shard/event_loop.rs
+++ b/src/shard/event_loop.rs
@@ -56,7 +56,7 @@ impl super::Shard {
         mut consumers: Vec<HeapCons<ShardMessage>>,
         producers: Vec<HeapProd<ShardMessage>>,
         shutdown: CancellationToken,
-        aof_tx: Option<channel::MpscSender<crate::persistence::aof::AofMessage>>,
+        aof_pool: Option<Arc<crate::persistence::aof::AofWriterPool>>,
         bind_addr: Option<String>,
         persistence_dir: Option<String>,
         snapshot_trigger_rx: channel::WatchReceiver<u64>,
@@ -1047,7 +1047,7 @@ impl super::Shard {
                             conn_accept::spawn_tokio_connection(
                                 tcp_stream, false, &tls_config,
                                 &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
-                                &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+                                &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
                                 &acl_table, &runtime_config, &server_config, &all_notifiers,
                                 &snapshot_trigger_tx, &repl_state, &cluster_state,
                                 &cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1097,7 +1097,7 @@ impl super::Shard {
                             conn_accept::spawn_tokio_connection(
                                 tcp_stream, is_tls, &tls_config,
                                 &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
-                                &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+                                &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
                                 &acl_table, &runtime_config, &server_config, &all_notifiers,
                                 &snapshot_trigger_tx, &repl_state, &cluster_state,
                                 &cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1180,7 +1180,7 @@ impl super::Shard {
                             conn_accept::spawn_migrated_tokio_connection(
                                 fd, state,
                                 &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
-                                &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+                                &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
                                 &acl_table, &runtime_config, &server_config, &all_notifiers,
                                 &snapshot_trigger_tx, &repl_state, &cluster_state,
                                 &cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1193,7 +1193,7 @@ impl super::Shard {
                             conn_accept::spawn_migrated_monoio_connection(
                                 fd, state,
                                 &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
-                                &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+                                &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
                                 &acl_table, &runtime_config, &server_config, &all_notifiers,
                                 &snapshot_trigger_tx, &repl_state, &cluster_state,
                                 &cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1277,7 +1277,7 @@ impl super::Shard {
                             conn_accept::spawn_migrated_tokio_connection(
                                 fd, state,
                                 &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
-                                &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+                                &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
                                 &acl_table, &runtime_config, &server_config, &all_notifiers,
                                 &snapshot_trigger_tx, &repl_state, &cluster_state,
                                 &cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1290,7 +1290,7 @@ impl super::Shard {
                             conn_accept::spawn_migrated_monoio_connection(
                                 fd, state,
                                 &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
-                                &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+                                &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
                                 &acl_table, &runtime_config, &server_config, &all_notifiers,
                                 &snapshot_trigger_tx, &repl_state, &cluster_state,
                                 &cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1697,7 +1697,7 @@ impl super::Shard {
                         &pubsub_arc,
                         &blocking_rc,
                         &shutdown,
-                        &aof_tx,
+                        &aof_pool,
                         &tracking_rc,
                         &lua_rc,
                         &script_cache_rc,
@@ -1739,7 +1739,7 @@ impl super::Shard {
                     &pubsub_arc,
                     &blocking_rc,
                     &shutdown,
-                    &aof_tx,
+                    &aof_pool,
                     &tracking_rc,
                     &lua_rc,
                     &script_cache_rc,
@@ -1921,7 +1921,7 @@ impl super::Shard {
                         &pubsub_arc,
                         &blocking_rc,
                         &shutdown,
-                        &aof_tx,
+                        &aof_pool,
                         &tracking_rc,
                         &lua_rc,
                         &script_cache_rc,
diff --git a/tests/ft_search_multi_shard_as_of.rs b/tests/ft_search_multi_shard_as_of.rs
index c8f8fef1..c2f93f11 100644
--- a/tests/ft_search_multi_shard_as_of.rs
+++ b/tests/ft_search_multi_shard_as_of.rs
@@ -46,6 +46,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/ft_search_temporal_parity.rs b/tests/ft_search_temporal_parity.rs
index 065de5f9..1165f81a 100644
--- a/tests/ft_search_temporal_parity.rs
+++ b/tests/ft_search_temporal_parity.rs
@@ -58,6 +58,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/integration.rs b/tests/integration.rs
index 1e6b51c8..d3c920ee 100644
--- a/tests/integration.rs
+++ b/tests/integration.rs
@@ -32,6 +32,7 @@ async fn start_server() -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
@@ -131,6 +132,7 @@ async fn start_server_with_pass(password: &str) -> (u16, CancellationToken) {
         databases: 16,
         requirepass: Some(password.to_string()),
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
@@ -1302,6 +1304,7 @@ async fn start_server_with_persistence(
         databases: 16,
         requirepass: None,
         appendonly: appendonly.to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: appendfsync.to_string(),
         save: None,
         dir: dir.to_string_lossy().to_string(),
@@ -2185,6 +2188,7 @@ async fn start_server_with_maxmemory(maxmemory: usize, policy: &str) -> (u16, Ca
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
@@ -2595,6 +2599,7 @@ async fn start_sharded_server(num_shards: usize) -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
@@ -3785,6 +3790,7 @@ async fn start_cluster_server() -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
@@ -4446,6 +4452,7 @@ async fn start_server_with_aclfile(acl_path: &str) -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/kill_snapshot.rs b/tests/kill_snapshot.rs
index 0a96c81f..4141b13c 100644
--- a/tests/kill_snapshot.rs
+++ b/tests/kill_snapshot.rs
@@ -28,6 +28,7 @@ fn base_config(port: u16, num_shards: usize) -> ServerConfig {
         databases: 1,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "no".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/mq_integration.rs b/tests/mq_integration.rs
index 6423a8e8..0ed42b5d 100644
--- a/tests/mq_integration.rs
+++ b/tests/mq_integration.rs
@@ -44,6 +44,7 @@ async fn start_mq_server(num_shards: usize) -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/replication_test.rs b/tests/replication_test.rs
index 3e36686c..634ee733 100644
--- a/tests/replication_test.rs
+++ b/tests/replication_test.rs
@@ -30,6 +30,7 @@ async fn start_server() -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: dir_path,
diff --git a/tests/txn_ft_search_snapshot.rs b/tests/txn_ft_search_snapshot.rs
index f442afe1..edbdc710 100644
--- a/tests/txn_ft_search_snapshot.rs
+++ b/tests/txn_ft_search_snapshot.rs
@@ -46,6 +46,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/txn_kv_wiring.rs b/tests/txn_kv_wiring.rs
index 4a7ec97c..00921f8e 100644
--- a/tests/txn_kv_wiring.rs
+++ b/tests/txn_kv_wiring.rs
@@ -49,6 +49,7 @@ async fn start_txn_server(num_shards: usize, persistence_dir: &str) -> (u16, Can
         databases: 16,
         requirepass: None,
         appendonly,
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir,
diff --git a/tests/vacuum_commands.rs b/tests/vacuum_commands.rs
index 4b37fc59..2d861220 100644
--- a/tests/vacuum_commands.rs
+++ b/tests/vacuum_commands.rs
@@ -35,6 +35,7 @@ fn base_config(port: u16) -> ServerConfig {
         databases: 1,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "no".to_string(),
         save: None,
         dir: ".".to_string(),
diff --git a/tests/workspace_integration.rs b/tests/workspace_integration.rs
index fb833edd..08e1e0c2 100644
--- a/tests/workspace_integration.rs
+++ b/tests/workspace_integration.rs
@@ -37,6 +37,7 @@ async fn start_workspace_server(num_shards: usize) -> (u16, CancellationToken) {
         databases: 16,
         requirepass: None,
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),
@@ -254,6 +255,7 @@ async fn start_workspace_server_with_auth(
         databases: 16,
         requirepass: Some(password),
         appendonly: "no".to_string(),
+        unsafe_multishard_aof: false,
         appendfsync: "everysec".to_string(),
         save: None,
         dir: ".".to_string(),