diff --git a/CHANGELOG.md b/CHANGELOG.md
index 1be8c9d0..9a6e1798 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -6,10 +6,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## [0.2.0-alpha] — Unreleased
-First slice of the v0.2 "Option C" beachhead: **Point-in-Time Recovery (PITR)**
-and **Change Data Capture (CDC)**, built additively on top of the existing
-per-shard WAL v3 + dual-root manifest. No changes to the KV hot path, MVCC,
-page format, or transaction layer.
+The v0.2 enterprise beachhead. Built additively on per-shard WAL v3 + the
+dual-root manifest; no changes to the KV hot path, MVCC, page format, or
+transaction layer.
+
+**Headline capabilities landed in alpha:**
+
+- **Point-in-Time Recovery (PITR)** — `--recovery-target-lsn` /
+ `--recovery-target-time` restore to any LSN or wall-clock boundary
+ inside the WAL retention window.
+- **Change Data Capture (CDC)** — `CDC.READ` polling command with
+ Debezium-compatible JSON envelopes, resumable cursors, segment-rotation
+ safety.
+- **Hash-field TTL** — full Valkey 9.0 / 9.1 surface (`HEXPIRE` /
+ `HPEXPIRE` / `HEXPIREAT` / `HPEXPIREAT` / `HEXPIRETIME` / `HPEXPIRETIME`
+ / `HTTL` / `HPTTL` / `HPERSIST` / `HGETDEL` / `HGETEX`) with O(1) HGET +
+ HLEN fast path. Three-way benchmark vs Redis 8.0.2 / Valkey 9.1.0
+ ships in `docs/perf/2026-05-27-hash-ttl-3way-bench.md`.
+- **Tier 2 Lane A** — `SWAPDB`, `MOVE`, `COPY ... DB n`,
+ `CLUSTER REPLICAS` / `SLAVES`, `CLUSTER COUNT-FAILURE-REPORTS`. All
+ WAL-durable with cross-shard atomic semantics.
+- **Storage format v1 commitment** — RDB v2 + WAL v3 + multi-part AOF
+ manifest grouped under a single `--storage-format v1` umbrella with
+ ≥18-month LTS forward-read guarantees.
+- **Embedded sharded server** — `server::embedded::run_embedded(config,
+ cancel)` exposes the full sharded handler (with `TXN.*`) to in-process
+ embedders.
+
+**What is not yet in alpha:** PITR live-snapshot LSN wiring (P3c),
+`CDC.SUBSCRIBE` push channel (C3b), and the multi-shard master PSYNC
+deferred from v0.1.10. Tracked in `.planning/rfcs/v02-enterprise-architecture.md`.
### Docs — Hash-field TTL three-way benchmark suite (PR #127)
@@ -427,6 +453,96 @@ See `docs/guides/cdc.md` for consumer integration.
and the benchmark gates (PITR restart ±10%, CDC ≥100K events/s/shard,
write p99 ±5%).
+## [0.1.12] — 2026-05-12
+
+Performance & memory observability release. 50 commits since v0.1.11, no
+public API breaks, no on-disk format change. Validated on OrbStack `moon-dev`
+(2026-05-12) and locally green for both `runtime-monoio` and
+`runtime-tokio,jemalloc`.
+
+### Performance — DashTable hot-path (Phase 189, PERF-07 + PERF-09)
+
+- **Pre-sized DashTable.** `DashTable::with_capacity()` plus the new
+ `--initial-keyspace-hint ` flag size the segment array up front so
+ steady-state operation hits zero `split_segment` calls. Pre-size
+ invariant test confirms zero splits at 1 M keys. The 27 % CPU spent
+ in `split_segment` during SET p=16 (PERF-07) is fully eliminated.
+- **`Database::set` rewrite.** New `DashTable::insert_or_update` /
+ `Segment::insert_or_update_at` single-probe helpers replace the
+ previous `find + remove + insert` triple-probe pattern.
+- **`Segment::find` fallback elimination + force-inlined SIMD.** The
+ cold "key spilled to non-home group" fallback path is removed once
+ `has_non_home_keys` is invariant-tracked on insert (the
+ `insert_or_update_at` change above already maintains the flag);
+ the SIMD probe helpers are `#[inline(always)]`. PERF-09
+ attributed 12.65 % of `Segment::find` self-time to the fallback;
+ remaining cost is the irreducible per-hit `memcmp` confirm
+ (threshold amended to <3 %). 1 M-key correctness gate validates
+ zero false positives/negatives.
+
+### Performance — Memory observability (Phase 190)
+
+- **`moon_memory_bytes{kind=…}` Prometheus gauge.** Seven subsystem
+ labels — `dashtable`, `hnsw`, `csr`, `wal`, `sealed_replication_backlog`,
+ `allocator_overhead`, and the rolled-up `total`. Updated every
+ scrape via a single hook so the sum reconciles to `RSS` within the
+ CI tolerance window.
+- **`MEMORY DOCTOR` full schema.** Multi-line RESP response covering
+ every subsystem, the rolled-up total, and a derived `allocator_overhead`
+ pseudo-kind (RSS − Σ subsystems). Adds operator triage signal beyond
+ the legacy single-line summary.
+- **`resident_bytes()` trait** implemented across `Database`,
+ `DashTable`, `VectorStore` (HNSW + IVF), `GraphStore` (CSR + SlotMap),
+ `WalWriter`, `ReplicationBacklog` (sealed-segment side), and
+ `AllocatorOverhead`. Zero-allocation, on-demand poll.
+- **Memory steady-state CI job.** `scripts/bench-memory-steady-state.sh`
+ + baseline fixture; gate widened to `±10 %` on RSS / Σ ratio after a
+ Linux-CI tolerance pass.
+
+### Changed — Allocator UX (Phase 191)
+
+- **jemalloc `narenas:8` cap** with `--memory-arenas-cap ` CLI
+ override. Caps the per-CPU arena explosion that inflates VSZ on
+ high-core hosts; mostly a cosmetic fix on Linux containers but
+ produces a meaningfully tighter `top`/`ps` reading for operators.
+- **Tri-state allocator selection.** New `mimalloc-alt` cargo
+ feature alongside the existing `jemalloc` / `mimalloc` (fallback)
+ paths; mutually exclusive at compile time. A/B benchmark script
+ `scripts/bench-allocator-ab.sh` ships with the release.
+- **`docs/OPERATOR-GUIDE.md` — Memory Accounting section.** Documents
+ the VSZ-vs-RSS distinction, MEMORY DOCTOR field-by-field, and the
+ `--memory-arenas-cap` / `mimalloc-alt` tuning knobs.
+
+### Added — Dispatch Observability (Phase 177)
+
+- **`moon_dispatch_path_total{path=...}` Prometheus counter**: four-way classification of every command by shard-routing decision — `local_inline` (SIMD fast path), `local` (standard local branch), `cross_read_fast` (RwLock shared-read bypass of SPSC), `cross_spsc` (deferred cross-shard write via `PipelineBatchSlotted`). Ratio `cross_spsc / Σ` is the ground-truth signal for dispatch-layer optimization work. Zero-allocation hot-path overhead (`&'static str` labels, `#[inline]` with early-return on `!METRICS_INITIALIZED`). Verified on macOS + Linux: counter sums close exactly to driven traffic, no overcount.
+
+### Changed
+
+- **`text-index` is now a default feature.** BM25 full-text search (`FT.SEARCH` BM25 mode), `FT.AGGREGATE`, and three-way RRF hybrid fusion are included in all standard builds. No longer requires `--features text-index`. To exclude it (e.g. minimal embedded builds): `--no-default-features --features runtime-monoio,jemalloc,graph`.
+
+### Added — SDK Validation
+
+- **Python SDK `sdk/python/examples/validate.py`**: End-to-end live validator for all SDK sub-clients: ping, strings, counter, hash, list, set, zset, vector index lifecycle, graph engine, session search, semantic cache, text search (BM25 + aggregate + hybrid), and server info. Result against Moon with `text-index`: **114 PASS / 0 FAIL / 0 SKIP**. Gracefully skips text sections when server built without `text-index`.
+- **Rust SDK `sdk/rust/examples/validate.rs`**: Re-validated against `text-index` build — **85 PASS / 0 FAIL**.
+
+### Fixed — Python SDK
+
+- **`moondb.graph._parse_neighbors`**: server returns alternating `[edge_map, node_map, ...]` as flat key-value arrays (`b'id'`, `int`, `b'src'`, `int`, `b'dst'`, `int`, …). Previous parser expected positional `[node_id, label, props]` — caused `int() on b'id'` crash. Now correctly identifies node entries by `labels` key and parses them from the flat kv format.
+
+### Fixed — CI Hygiene
+
+- **`tests/pipeline_auto_index.rs`**: tighten outer cfg from `runtime-tokio` to `all(runtime-tokio, text-index)` so the file compiles to zero tests when text-index is disabled. Previously the file compiled but the FT.SEARCH text fast path was `#[cfg]`-ed out, causing `@name:corpus` queries to fall through to the KNN-only parser and panic with "invalid KNN query syntax".
+- **4 FT unwraps**: add inline `#[allow(clippy::unwrap_used)]` with invariant justifications in `vector_search/ft_text_search.rs` (3 sites inside `apply_post_processing` where `do_summarize` / `do_highlight` implies the Option is Some) and `handler_monoio/ft.rs:165` (`is_text` was derived from `query_bytes.as_ref().map_or(false, _)`). Restores the audit-unwrap baseline to 0.
+
+### Compatibility
+
+- **Wire protocol**: unchanged. Drop-in replacement for v0.1.11.
+- **Persistence on-disk format**: unchanged.
+- **Default feature set**: `text-index` is now on by default. Minimal
+ embedded builds need an explicit `--no-default-features --features
+ runtime-monoio,jemalloc,graph`.
+
## [0.1.11] — 2026-04-27
Hot-path perf release — eliminates two atomic-CAS hot paths in the write
@@ -537,29 +653,47 @@ All 2450 lib tests passing locally (`cargo test --no-default-features
- **Persistence on-disk format**: unchanged.
- **Replication wire format**: unchanged.
-## [Unreleased]
-
-### Added — Dispatch Observability (Phase 177)
-
-- **`moon_dispatch_path_total{path=...}` Prometheus counter**: four-way classification of every command by shard-routing decision — `local_inline` (SIMD fast path), `local` (standard local branch), `cross_read_fast` (RwLock shared-read bypass of SPSC), `cross_spsc` (deferred cross-shard write via `PipelineBatchSlotted`). Ratio `cross_spsc / Σ` is the ground-truth signal for dispatch-layer optimization work. Zero-allocation hot-path overhead (`&'static str` labels, `#[inline]` with early-return on `!METRICS_INITIALIZED`). Verified on macOS + Linux: counter sums close exactly to driven traffic, no overcount.
-
-### Fixed — CI Hygiene
-
-- **`tests/pipeline_auto_index.rs`**: tighten outer cfg from `runtime-tokio` to `all(runtime-tokio, text-index)` so the file compiles to zero tests when text-index is disabled. Previously the file compiled but the FT.SEARCH text fast path was `#[cfg]`-ed out, causing `@name:corpus` queries to fall through to the KNN-only parser and panic with "invalid KNN query syntax".
-- **4 FT unwraps**: add inline `#[allow(clippy::unwrap_used)]` with invariant justifications in `vector_search/ft_text_search.rs` (3 sites inside `apply_post_processing` where `do_summarize` / `do_highlight` implies the Option is Some) and `handler_monoio/ft.rs:165` (`is_text` was derived from `query_bytes.as_ref().map_or(false, _)`). Restores the audit-unwrap baseline to 0.
-
-### Changed
-
-- **`text-index` is now a default feature.** BM25 full-text search (`FT.SEARCH` BM25 mode), `FT.AGGREGATE`, and three-way RRF hybrid fusion are included in all standard builds. No longer requires `--features text-index`. To exclude it (e.g. minimal embedded builds): `--no-default-features --features runtime-monoio,jemalloc,graph`.
-
-### Added — SDK Validation
-
-- **Python SDK `sdk/python/examples/validate.py`**: End-to-end live validator for all SDK sub-clients: ping, strings, counter, hash, list, set, zset, vector index lifecycle, graph engine, session search, semantic cache, text search (BM25 + aggregate + hybrid), and server info. Result against Moon with `text-index`: **114 PASS / 0 FAIL / 0 SKIP**. Gracefully skips text sections when server built without `text-index`.
-- **Rust SDK `sdk/rust/examples/validate.rs`**: Re-validated against `text-index` build — **85 PASS / 0 FAIL**.
-
-### Fixed — Python SDK
-
-- **`moondb.graph._parse_neighbors`**: Server returns alternating `[edge_map, node_map, ...]` as flat key-value arrays (`b'id'`, `int`, `b'src'`, `int`, `b'dst'`, `int`, …). Previous parser expected positional `[node_id, label, props]` — caused `int() on b'id'` crash. Now correctly identifies node entries by `labels` key and parses them from the flat kv format.
+## [0.1.10] — 2026-04-23
+
+Stable replication marker. **Single-shard PSYNC2 wired end-to-end and
+production-ready** for `--shards 1` master with any `--shards N` replica
+topology. Multi-shard master PSYNC is scheduled for v0.2 (see
+`.planning/rfcs/multi-shard-replication-design.md`).
+
+- **Replication** (`081c43b`): single-shard master PSYNC2 end-to-end wired,
+ REPLCONF validated, `master_link_status` reports the actual handshake
+ state instead of the legacy `up` stub.
+- **Performance**: batch-level eviction gate; `try_handle_*` paths
+ `#[inline]`-ed; DashTable carries through the v0.1.10 pre-size
+ groundwork (capacity hint + headroom).
+- **Docs**: BENCHMARK.md §2.7 updated with the 2026-04-22 GCloud
+ re-measurement; v0.1.x replication scope documented under
+ `docs/guides/clustering.mdx#replication`.
+
+## [0.1.9] — 2026-04-19
+
+**Lunaris Retriever Gap Closure.** Every v0.1.8 client-side fallback in
+the Lunaris SDK is now closed so `HybridRRFRetriever` (dense path),
+`GraphFirstRetriever`, and `PathReasoningRetriever` run Moon-native.
+
+- **Phase 167 CYP-01/02**: Cypher `CREATE` / `MERGE` writes participate
+ in `CrossStoreTxn` via `record_graph()`; `TXN.ABORT` rolls them back.
+- **Phase 168 CYP-03/06**: `coalesce()` built-in + single-hop edge-var
+ binding in variable-length `EXPAND`.
+- **Phase 169 CYP-04/05**: `shortestPath()` parser + Dijkstra executor
+ bridge with path-variable binding.
+- **Phase 170 HYB-01/02/04**: `FT.SEARCH HYBRID` dense stream honours
+ `as_of_lsn`.
+- **Phase 171 SCAT-01/02/03**: `ShardMessage::VectorSearch` +
+ `FtHybridPayload` carry `as_of_lsn` for multi-shard `AS_OF` correctness.
+- **Phase 172 PIPE-01/02/03**: pipeline-aware HSET auto-indexing
+ regression guard (3-test suite).
+
+Audit status: **PASSED_WITH_DOCUMENTED_DEFERRALS**. 15 / 20 requirements
+fully satisfied; HYB-03 BM25 MVCC deferred and closed in v0.1.10
+follow-up (G-1); Phase 173 hygiene HYG-02 handler split RFC'd.
+
+Stats: 6 phases shipped, 17 plans, 27 files changed, +2924 / −376 LOC.
## [0.1.8] — 2026-04-18
@@ -837,7 +971,14 @@ All 2450 lib tests passing locally (`cargo test --no-default-features
- **Security hardening:** `deny.toml` (cargo-deny), `SECURITY.md`, `docs/THREAT-MODEL.md`, `docs/security/lua-sandbox.md`, TLS cipher suite freeze
- **Release engineering:** `docs/versioning.md`, 6 operator runbooks, CHANGELOG CI gate, user docs (getting-started, configuration, monitoring), release pipeline SHA256 checksums + SBOM + cosign
-## [Earlier Unreleased] - Dispatch Hot-Path Recovery (2026-04-08)
+## [0.1.3] — 2026-04-10
+
+Production-readiness foundation: dispatch hot-path recovery, vector-search
+4× QPS + correctness fixes, and the tiered disk-offload landing with 100 %
+crash recovery across 7 persistence configurations. Bundles three work
+streams originally tracked as separate Unreleased blocks (Apr 7–8).
+
+### Dispatch Hot-Path Recovery (2026-04-08)
**Pipelined SET +37%, pipelined GET +68% at p=16 after PR #43 regression recovery.**
@@ -934,9 +1075,7 @@ captured as todo in `.planning/todos/pending/`.
---
-## [Unreleased] - Vector Search 4x QPS + Correctness
-
-### Vector Search Performance & Correctness (2026-04-07)
+### Vector Search 4× QPS + Correctness (2026-04-07)
**4x search QPS, 4.1x lower latency, 2.56x faster than Qdrant on real MiniLM data.**
@@ -991,9 +1130,9 @@ embeddings (clustered) achieve 0.92-0.97 recall with the same code.
---
-## [Earlier Unreleased] - Disk Offload & x86_64 Performance
+### Disk Offload & x86_64 Performance (2026-04-06)
-Tiered storage, crash recovery, and 2x Redis on x86_64 (Intel Xeon, io_uring).
+Tiered storage, crash recovery, and 2× Redis on x86_64 (Intel Xeon, io_uring).
### Added
diff --git a/README.md b/README.md
index 5fcfbbe7..350c8861 100644
--- a/README.md
+++ b/README.md
@@ -7,11 +7,12 @@
-
+
-
+
+
@@ -20,17 +21,38 @@
Quick start •
Why Moon •
Benchmarks •
+ Moon vs Redis vs Valkey •
+ Production readiness •
Docs •
Changelog
---
-> **⚠ Experimental.** Moon is under active development and **not** recommended for production. Storage formats, APIs, and config flags may change between releases. Please [open an issue](https://github.com/pilotspace/moon/issues) if something breaks.
+> **Production-grade architecture, pre-1.0 maturity.** Single-node Moon
+> (`--shards N` master, `--shards 1` for replication-eligible workloads)
+> is recommended for production caching, vector / graph / feature-store
+> workloads, and Redis-compatible OLTP. Multi-node clustering and
+> multi-shard master PSYNC are **alpha in v0.2** — see
+> [Production readiness](#production-readiness) for the honest matrix of
+> what is and isn't yet GA. Wire protocol and on-disk format are LTS as of
+> v0.2 (`docs/STORAGE-FORMAT-V1.md`); CLI flags may still evolve until
+> v1.0. [Open an issue](https://github.com/pilotspace/moon/issues) if
+> something breaks.
---
-Moon speaks the Redis wire protocol (RESP2/RESP3) and implements 230+ commands. It runs on **Linux** (io_uring via monoio) and **macOS** (kqueue via monoio) with a thread-per-core, shared-nothing architecture, per-shard WAL, tiered disk offload, an in-process vector search engine with BM25 full-text search, a property graph engine with Cypher subset, cross-store ACID transactions, workspace partitioning, durable message queues, bi-temporal MVCC, and an embedded web console. Any Redis client connects out of the box.
+Moon is a clean-room Rust rewrite of a Redis-compatible in-memory data
+store with first-class AI primitives. It speaks the Redis wire protocol
+(RESP2/RESP3) and implements 230+ commands — every standard Redis data
+type plus native `FT.*` vector + BM25 search, `GRAPH.*` Cypher, `TXN.*`
+cross-store ACID, workspaces, durable message queues, bi-temporal MVCC,
+and an embedded web console. Primary target is **Linux** with io_uring
+(`monoio`); a `tokio` runtime is available for portability. **macOS** is a
+supported development platform (kqueue via monoio); production
+deployments should target Linux (see
+[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) Tier 1/2).
+Any Redis client connects out of the box.
## Why Moon
@@ -61,18 +83,30 @@ Moon speaks the Redis wire protocol (RESP2/RESP3) and implements 230+ commands.
## Benchmarks
-Measured vs Redis 8.6.1, co-located client and server, pipeline depth tuned per row. Full methodology and reproduction steps in [BENCHMARK.md](BENCHMARK.md) and [docs/benchmarks.mdx](docs/benchmarks.mdx).
+Measured vs Redis 8.6.1 (peak throughput) and Redis 8.0.2 + Valkey 9.1.0
+(hash-TTL surface, the only workload where all three were head-to-head
+benchmarked). Co-located client and server, pipeline depth tuned per
+row. Full methodology and reproduction steps in
+[BENCHMARK.md](BENCHMARK.md) and [docs/benchmarks.mdx](docs/benchmarks.mdx).
+Valkey peak-throughput columns are intentionally blank — a head-to-head
+peak-RPS bench on identical hardware has not yet been run; the
+[Moon vs Redis vs Valkey](#moon-vs-redis-vs-valkey) section quotes
+Valkey's vendor-published 2.1M RPS (9 I/O threads, p=10) for context.
### Peak throughput (GCloud c3-standard-8, x86_64, monoio io_uring)
-| Workload | Moon | Redis | Ratio |
-|----------------------------------|-------:|------:|:------:|
-| Peak GET (c=50, p=64) | 5.11M | 2.98M | **1.72×** |
-| Peak SET (c=50, p=64) | 3.50M | 1.82M | **1.92×** |
-| GET, production defaults (AOF+disk-offload) | 4.76M | 2.46M | **1.93×** |
-| GET, max durability (fsync always)| 4.85M | 2.45M | **1.98×** |
-| Memory, values ≥ 1 KB | — | — | **27–35% less** |
-| Crash recovery (SIGKILL, 5K keys)| 100% | 100% | parity |
+| Workload | Moon | Redis 8.6.1 | Valkey 9.1.0 |
+|------------------------------------------------|-------:|------------:|-------------:|
+| Peak GET (c=50, p=64) | 5.11M | 2.98M (1.72×) | not yet benched |
+| Peak SET (c=50, p=64) | 3.50M | 1.82M (1.92×) | not yet benched |
+| GET, production defaults (AOF + disk-offload) | 4.76M | 2.46M (1.93×) | not yet benched |
+| GET, max durability (`fsync=always`) | 4.85M | 2.45M (1.98×) | not yet benched |
+| Memory, values ≥ 1 KB | — | **27–35 % less** | not yet benched* |
+| Crash recovery (SIGKILL, 5K keys) | 100 % | 100 % (parity)| 100 % (parity, vendor-claimed) |
+
+*Valkey 9.1 raised the embstr threshold to 128 B; below ~64 B Valkey
+9.1 may be tighter than Moon. A head-to-head re-bench across the value
+size curve is on the v0.2 roadmap.
### ARM64 (GCloud t2a-standard-8, Neoverse-N1)
@@ -98,9 +132,13 @@ Measured vs Redis 8.6.1, co-located client and server, pipeline depth tuned per
| Native API QPS | **19×** | N/A |
| Bulk insert | **23×** | 1× |
-### Hash-field TTL — Valkey 9.0/9.1 parity (OrbStack moon-dev, n=200K c=50, median of 3)
+### Hash-field TTL — Valkey 9.0/9.1 parity (v0.2.0-alpha preview)
-Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodology + 26 scenarios in [docs/perf/2026-05-27-hash-ttl-3way-bench.md](docs/perf/2026-05-27-hash-ttl-3way-bench.md); reproducible via [scripts/bench-hash-ttl-3way.sh](scripts/bench-hash-ttl-3way.sh).
+> **Status:** ships in **v0.2.0-alpha** (currently on `main`, no tagged
+> release yet). Not present in the latest tagged build `v0.1.12`. Build
+> from `main` to reproduce these numbers.
+
+OrbStack moon-dev, n=200K c=50, median of 3. Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodology + 26 scenarios in [docs/perf/2026-05-27-hash-ttl-3way-bench.md](docs/perf/2026-05-27-hash-ttl-3way-bench.md); reproducible via [scripts/bench-hash-ttl-3way.sh](scripts/bench-hash-ttl-3way.sh).
| Command | Pipeline | Moon | Redis 8.0.2 | Valkey 9.1.0 |
|------------------|----------|------:|------------:|-------------:|
@@ -111,7 +149,64 @@ Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodol
| `HGETEX EX` | p=1 | 250K | N/A | 251K |
| `HGETEX` no-mode | p=1 | 250K | N/A | 253K |
-Plain Hash HGET p=16 ties Redis (1.01×) and Valkey (1.00×). HEXPIRE-family Moon vs Valkey: 0.90–0.99× across the surface (Valkey leads HEXPIRE p=16 by 7%, HTTL p=16 by 10%; HGETEX hits 0.99× parity). Redis 8.x has no HEXPIRE-family — Moon is the only Redis-compatible alternative aside from Valkey. The internal `HashWithTtl` HGET / HLEN paths use a cached `min_expiry_ms` for an O(1) fast path that brings them to 1.03× of plain `Hash` (was 80× slower pre-fix; see PR #126).
+Plain Hash HGET p=16 ties Redis (1.01×) and Valkey (1.00×). HEXPIRE-family Moon vs Valkey: 0.90–0.99× across the surface (Valkey leads HEXPIRE p=16 by 7 %, HTTL p=16 by 10 %; HGETEX hits 0.99× parity). Redis 8.x has no HEXPIRE-family — Moon is the only Redis-compatible alternative aside from Valkey. The internal `HashWithTtl` HGET / HLEN paths use a cached `min_expiry_ms` for an O(1) fast path that brings them to 1.03× of plain `Hash` (was 80× slower pre-fix; see PR #126).
+
+## Moon vs Redis vs Valkey
+
+Three Redis-protocol-compatible servers, three different bets. Moon
+competes on a **vertical moat** — thread-per-core architecture and an
+AI-native in-core surface. Valkey competes on a **horizontal moat** —
+Linux Foundation governance, every major cloud, drop-in compatibility.
+Redis OSS continues as the upstream reference but ships under SSPL since
+March 2024. The deep architectural review is in
+[`docs/comparison-valkey.md`](docs/comparison-valkey.md) (~22 KB,
+traced to source).
+
+| Dimension | **Moon v0.1.12 / 0.2.0-alpha** | **Valkey 9.1.0** | **Redis 8.6.1 (OSS)** |
+|--------------------------------------|--------------------------------|-----------------------------|----------------------------|
+| Language / license | Rust 2024 / Apache-2.0 | C99 / BSD-3-Clause (LF TSC) | C99 / **SSPL** since 2024 |
+| Threading model | Thread-per-core, shared-nothing | Single main thread + ≤9 I/O threads | Single-threaded core (+ I/O threads) |
+| I/O driver (Linux) | io_uring (`monoio`) | epoll only | epoll only |
+| Snapshot | **Forkless** (segment-level COW) | `fork()` + COW | `fork()` + COW |
+| AOF / WAL | Per-shard WAL v3 + multi-part AOF | Single global AOF | Single global AOF |
+| Tiered NVMe disk offload | **Yes** (under `maxmemory`) | No (OSS) | No (OSS — Redis Flash is Enterprise) |
+| Vector search | **In-core** HNSW + TurboQuant 1-8 bit | `valkey-search` module, FP32 only | RediSearch module |
+| Full-text BM25 | **In-core** | `valkey-search` module | RediSearch module |
+| Property graph (Cypher) | **In-core** 14 `GRAPH.*` cmds | None | None (FalkorDB separate) |
+| Cross-store ACID | `TXN.BEGIN/COMMIT/ABORT` | None | None |
+| Hash-field TTL (`HEXPIRE`-family) | **Yes** (Valkey-parity, v0.2-alpha) | **Yes** (9.0+) | No |
+| PITR + CDC | `--recovery-target-lsn` + `CDC.READ` (v0.2-alpha) | None | None |
+| Embedded web console | **Yes** (7-view React, in-binary) | Valkey Admin GUI 1.0 (separate) | Redis Insight (separate) |
+| Managed cloud offerings | None (yet) | AWS, GCP, Oracle, Aiven, … | Redis Cloud (vendor) |
+| Multi-node cluster, soak-tested | **v0.2 alpha** (single-node GA today) | **Production** | **Production** |
+| Atomic slot migration | Planned (v0.2) | Yes (9.0) | No |
+| Peak single-server throughput | **5.11M GET/s** (c3-8 x86_64) | 2.1M RPS (9 I/O threads, p=10, vendor) | 2.98M GET/s (c3-8, same harness as Moon) |
+
+### When to choose Moon
+
+- Single-node Redis-compatible workloads where peak throughput,
+ memory efficiency at ≥1 KB values, or **forkless snapshots** matter.
+- AI-native applications: vector search, GraphRAG, semantic cache,
+ hybrid BM25 + dense + sparse retrieval — all in one binary, no module
+ loader, with cross-store ACID across KV / vector / graph.
+- Workloads that benefit from **tiered NVMe offload** under `maxmemory`
+ instead of LRU-eviction-then-rebuild.
+
+### When to choose Valkey
+
+- Multi-node clusters with proven 1000+ node operational mileage.
+- Managed-cloud-only deployments (every major cloud offers Valkey).
+- Strict drop-in compatibility with the Redis 7.2 module ecosystem
+ (`valkey-json`, `valkey-bloom`, `valkey-search`, `valkey-ldap`).
+- Risk-averse environments where Linux Foundation governance is a
+ procurement requirement.
+
+### When to stay on Redis OSS
+
+- Existing investments in RediSearch / RedisJSON / RedisBloom under
+ RLEC, or pre-SSPL-tolerance OSS Redis.
+- Specific Redis Enterprise features (CRDT active-active, Redis Flash)
+ that Moon and Valkey OSS do not match.
## Quick start
@@ -148,9 +243,9 @@ OK
"world"
127.0.0.1:6379> HSET user:1 name Alice age 30
(integer) 2
-127.0.0.1:6379> HEXPIRE user:1 3600 FIELDS 1 age # Valkey 9.0 per-field TTL
+127.0.0.1:6379> HEXPIRE user:1 3600 FIELDS 1 age # Valkey 9.0 per-field TTL (v0.2.0-alpha; build from `main`)
1) (integer) 1
-127.0.0.1:6379> HTTL user:1 FIELDS 1 age
+127.0.0.1:6379> HTTL user:1 FIELDS 1 age # v0.2.0-alpha
1) (integer) 3600
127.0.0.1:6379> FT.CREATE idx ON HASH PREFIX 1 doc: SCHEMA emb VECTOR HNSW 6 DIM 384 TYPE FLOAT32 DISTANCE_METRIC COSINE
OK
@@ -326,34 +421,123 @@ cargo flamegraph --bin moon -- --port 6399 --shards 1
Contribution guide and coding rules (unsafe policy, hot-path allocation rules, lock discipline) are in [CLAUDE.md](CLAUDE.md) and [UNSAFE_POLICY.md](UNSAFE_POLICY.md).
-## Roadmap
-
-Moon is pre-1.0 and **experimental**. Current focus:
-
-- Correctness parity with Redis 8.x across the full command surface
-- AI-native primitives: session dedup, hybrid vector+sparse search, agentic caching
-- Multi-node clustering with gossip, slot migration, and PSYNC2 replication
-- GPU-accelerated vector search (CUDA, feature-gated)
-- Production hardening and SLO validation (see [docs/PRODUCTION-CONTRACT.md](docs/PRODUCTION-CONTRACT.md))
-
-Completed in v0.1.0–v0.1.8:
-- Tiered disk offload (RAM → NVMe) with 100% crash recovery
-- In-process vector search (HNSW + TurboQuant 4/8-bit) with `FT.*` API
-- BM25 full-text search with three-way hybrid fusion (BM25 + dense + sparse)
-- Property graph engine with Cypher subset (14 `GRAPH.*` commands)
-- Cross-store ACID transactions (`TXN.BEGIN`/`COMMIT`/`ABORT`) across KV, vector, and graph
-- Workspace partitioning for multi-tenant namespace isolation
-- Durable message queues with dead-letter and debounced triggers
-- Bi-temporal MVCC for point-in-time KV and graph queries
-- Web console (7 views, embedded in binary)
-- macOS support (aarch64 + x86_64, both runtimes)
-- Thread-per-core dispatch optimization (5.11M GET/s on x86_64)
-
-Production readiness is **not** a v0.1 goal. Storage formats, APIs, and config flags may change between releases.
-
-## Production Readiness
-
-Moon's v1.0 promises — SLOs, durability modes, supported platforms, and a machine-checkable GA exit-criteria checklist — live in **[docs/PRODUCTION-CONTRACT.md](docs/PRODUCTION-CONTRACT.md)**. The contract is the single source of truth every v0.1.3 hardening phase tests against.
+## Production readiness
+
+Honest matrix of where Moon is today (v0.1.12 GA + v0.2.0-alpha
+unreleased). Read alongside
+[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) (the
+machine-checkable GA exit criteria) and
+[`docs/OPERATOR-GUIDE.md`](docs/OPERATOR-GUIDE.md) (memory accounting,
+sizing, runbooks).
+
+### Recommended for production today
+
+- **Single-node deployments** — Linux aarch64 (Tier 1) or Linux x86_64
+ (Tier 2). `--shards N` master, one process, one node.
+- **Read replication** — `--shards 1` master with any `--shards N`
+ replica topology. Single-shard PSYNC2 is wired end-to-end since
+ v0.1.10.
+- **AI workloads** — vector search (HNSW + TurboQuant), BM25 full-text,
+ GraphRAG, semantic caching, hybrid retrieval. All in-core, all
+ RDB / WAL durable, all crash-recovery validated.
+- **Cache + feature store** — durability modes are honest
+ (`always` / `everysec` / `no` with documented recovery bounds), forkless
+ snapshots remove the Redis fork-COW RSS spike, tiered NVMe offload
+ under `maxmemory` keeps working sets larger than RAM.
+- **Crash recovery** — 100 % survived across 7 persistence
+ configurations and 5 K-key SIGKILL workloads. RDB v2 + WAL v3 +
+ multi-part AOF + tiered cold tier all participate.
+
+### Not yet GA — avoid for production
+
+- **Multi-node clustering** (16 K-slot gossip, MOVED/ASK, failover) —
+ protocol-compatible code exists but **PSYNC2 atomic slot migration
+ is not soak-tested**. Valkey 9.0 shipped this; Moon has not.
+ Scheduled for v0.2.
+- **Multi-shard master PSYNC** — single-shard only today; multi-shard
+ master replication is RFC'd in
+ [`.planning/rfcs/multi-shard-replication-design.md`](.planning/rfcs/multi-shard-replication-design.md).
+- **PITR live-snapshot LSN wiring** (P3c) and **`CDC.SUBSCRIBE` push
+ channel** (C3b) — `CDC.READ` polling is alpha-ready; push and
+ zero-snapshot PITR are deferred to v0.2 follow-ups.
+- **GPU vector acceleration** (`gpu-cuda` feature) — kernel scaffold
+ exists; production kernels not yet shipping.
+- **macOS native** — first-class development platform with full feature
+ set minus io_uring, but production deployments should target Linux
+ per
+ [`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) tiers.
+- **Performance SLO numbers in
+ [`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md)** — marked
+ `[provisional]` until the Phase 97 24-h HDR-histogram rig validates
+ them on reference hardware. Use the benchmarks above as point-in-time
+ measurements, not committed SLOs.
+
+### Operator gotchas worth knowing before you deploy
+
+- **Multi-shard scaling needs load.** Aim for `clients ≥ 25 × shards`
+ on pipeline ≥ 16 workloads; below that, multiple shards under-subscribe
+ the dispatch loop and a single shard wins. Random keyspaces with
+ `p=1` benefit less than `{tag}`-co-located keys (see
+ [`CLAUDE.md`](CLAUDE.md) "Gotchas" + the v0.1.12 multi-shard memo).
+- **Fairness flags for benchmarking against Redis / Valkey** —
+ `--disk-offload disable --appendonly no` removes Moon's durability
+ overhead (~26 % on SET p=64) so comparisons are apples-to-apples.
+- **Memory accounting** — bill on RSS, not VSZ. `MEMORY DOCTOR`,
+ `moon_memory_bytes{kind=…}` Prometheus gauge, and
+ [`docs/OPERATOR-GUIDE.md#memory-accounting`](docs/OPERATOR-GUIDE.md#memory-accounting)
+ cover the full VSZ-vs-RSS guide and tuning knobs
+ (`--memory-arenas-cap`, `mimalloc-alt`).
+- **RDB hash-field TTL trailer** is RDB v2 only — older v1 readers stop
+ after the hash body and silently drop per-field TTLs. Pin storage
+ format to `v1` (the umbrella covering v2/v3 sub-formats) per
+ [`docs/STORAGE-FORMAT-V1.md`](docs/STORAGE-FORMAT-V1.md).
+
+### Roadmap
+
+| Milestone | Focus | Status |
+|------------------|---------------------------------------------------------------------------------------------|-------------|
+| **v0.2.0** (next) | Multi-node clustering soak (PSYNC2 + atomic slot migration); PITR P3c; CDC push (`SUBSCRIBE`) | alpha |
+| **v0.2.x** | GPU vector acceleration (`gpu-cuda`); operator runbooks; full SLO lock-in (`PERF-01..05`) | planned |
+| **v1.0** | Every [`PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) GA exit-criteria box ticked | gate |
+
+What's already in `main` (v0.1.0 → v0.2.0-alpha, 14 months of work):
+
+> Hash-field TTL and PITR + CDC ship in **v0.2.0** — currently alpha on
+> `main`, no tagged release yet. **v0.1.12** is the latest tag and does
+> NOT include them. Everything else below is in `v0.1.12` GA.
+
+**Shipped in v0.1.12 (latest tag, single-node production-grade):**
+
+- Forkless persistence (RDB v2 + per-shard WAL v3 + multi-part AOF).
+- Tiered disk offload (RAM → NVMe) with 100 % crash recovery.
+- In-process vector search (HNSW + TurboQuant 1–8-bit) — `FT.*` surface.
+- BM25 full-text search + three-way RRF hybrid fusion (BM25 + dense + sparse).
+- Property graph engine with Cypher subset (14 `GRAPH.*` commands).
+- Cross-store ACID (`TXN.BEGIN` / `COMMIT` / `ABORT`) across KV, vector,
+ graph.
+- Workspaces, durable message queues, bi-temporal MVCC.
+- Web console (7-view React app, embedded in binary).
+- Thread-per-core dispatch optimization (5.11M GET/s on x86_64).
+
+**Added in v0.2.0-alpha (on `main`, untagged):**
+
+- Hash-field TTL — Valkey 9.0 / 9.1 parity, O(1) HGET / HLEN fast path
+ (`HEXPIRE`, `HEXPIREAT`, `HPEXPIRE`, `HPEXPIREAT`, `HEXPIRETIME`,
+ `HPEXPIRETIME`, `HTTL`, `HPTTL`, `HPERSIST`, `HGETEX`, `HGETDEL`).
+- PITR — `--recovery-target-lsn` deterministic WAL replay-to-LSN.
+- CDC — `CDC.READ` pull-mode change stream (push-mode planned for v0.2.0
+ GA).
+- Multi-node clustering soak (PSYNC2 + atomic slot migration).
+
+## Production Readiness Contract
+
+Moon's v1.0 promises — per-command-class SLOs, durability modes, supported
+platforms, security guarantees, and a machine-checkable GA exit-criteria
+checklist — live in
+**[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md)**. Every
+v0.1.3+ hardening phase ticks off items on that checklist; nothing
+promotes to `v1.0-rc1` until every box is green. The contract is the
+single source of truth for what Moon owes you in production.
## Credits
diff --git a/docs/runbooks/multi-shard-aof-rewrite.md b/docs/runbooks/multi-shard-aof-rewrite.md
new file mode 100644
index 00000000..9c33b609
--- /dev/null
+++ b/docs/runbooks/multi-shard-aof-rewrite.md
@@ -0,0 +1,97 @@
+# Runbook — multi-shard + AOF refused at startup
+
+**Status:** active (Moon ≥ v0.1.13). Lifted in v2.0 once multi-shard
+AOF replay walks every shard's segment manifest on recovery.
+
+## What you saw
+
+### At startup
+
+```
+REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss
+bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1,
+or pass --appendonly no for cache-only deployments, or pass
+--unsafe-multishard-aof to acknowledge the risk and start anyway. See
+docs/runbooks/multi-shard-aof-rewrite.md.
+```
+
+### Or at command time (defence-in-depth, fires under any escape-hatch)
+
+```
+> BGREWRITEAOF
+(error) ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable
+ + --appendonly yes (known data-loss bug; see
+ docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1, set
+ --disk-offload disable, or wait for v2.0 multi-part AOF replay.
+```
+
+## Why this gate exists
+
+Verified on `main` at commit `6e49050` (2026-05-26), reproducers in
+[`tmp/p0-no-rewrite.sh`](../../tmp/p0-no-rewrite.sh),
+[`tmp/p0-always.sh`](../../tmp/p0-always.sh),
+[`tmp/p0-multishard-no-offload.sh`](../../tmp/p0-multishard-no-offload.sh):
+
+| Configuration | Result |
+|----------------------------------------------------------------------------|------------------------------|
+| `--shards 1 --appendonly yes --appendfsync always` (control) | ✅ Recovers 5000 / 5000 |
+| `--shards 1 --disk-offload enable --appendonly yes` (control) | ✅ Recovers 12 714 / 12 714 |
+| `--shards 2 --disk-offload enable --appendonly yes --appendfsync everysec` | ❌ Loses 38 % (12 662 → 7 892) |
+| `--shards 2 --disk-offload enable --appendonly yes --appendfsync always` | ❌ Loses 50 % (5000 → 2474) |
+| `--shards 2 --disk-offload disable --appendonly yes --appendfsync always` | ❌ Loses 50 % (5000 → 2453) |
+
+**The bug is in the multi-shard AOF durability path itself**, not the
+rewrite path and not the disk-offload tier. `--appendfsync always` and
+`--disk-offload disable` do not save you — only `--shards 1` does.
+
+The rewrite-specific gate (the `BGREWRITEAOF` error above) is still
+shipped as defence-in-depth for anyone who passes
+`--unsafe-multishard-aof`.
+
+## How to recover from a triggered loss (if you hit this on < v0.1.13)
+
+1. If a recent RDB snapshot exists in `--dir`, stop the server, move
+ `appendonlydir/` aside, and let recovery rebuild from RDB +
+ surviving per-shard WAL only. RPO equals the time since the RDB
+ snapshot.
+2. If replication was running, promote a non-affected replica
+ (`REPLICAOF NO ONE`) and re-sync the affected node.
+3. If neither: data is lost. File a `P0` with the AOF manifest
+ contents (`appendonlydir/moon.aof.manifest`) and per-shard WAL
+ sizes.
+
+## How to avoid the gate
+
+Pick whichever matches your deployment:
+
+| Option | Trade-off |
+|----------------------------------------------------|------------------------------------------------------------------------------------------|
+| `--shards 1` | **Recommended.** Best throughput on non-pipelined workloads; gives up multi-shard fan-out |
+| `--appendonly no` | Cache-only deployments; durability falls back to `save` rules + RDB recovery |
+| `--unsafe-multishard-aof` | **Discouraged.** Acknowledges ~50 % loss on crash; suitable only for ephemeral caches |
+
+The first option also clears the `BGREWRITEAOF` defence-in-depth gate.
+
+## When will this be removed?
+
+v2.0 ships multi-shard AOF replay that walks every shard's segment
+manifest on recovery. Both gates (startup refusal + `BGREWRITEAOF`
+error) are removed at the same time. Track progress at
+[`tmp/SHIP-PLAN-v1.0-rc1-single-node.md`](../../tmp/SHIP-PLAN-v1.0-rc1-single-node.md)
+§ Track B.
+
+## Telemetry
+
+When `--unsafe-multishard-aof` is passed AND the suspect config is set,
+the BGREWRITEAOF-specific gate also logs at startup at `WARN`:
+
+```
+BGREWRITEAOF gated for this config (known data-loss path; see
+docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or
+--disk-offload disable to re-enable rewrite.
+```
+
+Each gated `BGREWRITEAOF` invocation also returns the documented `ERR`
+line at the wire, so any operator dashboard tailing `slowlog` or
+client-side error counters will surface the refusal immediately. A
+dedicated Prometheus gauge for both gates is on the v1.0-rc1 backlog.
diff --git a/src/command/persistence.rs b/src/command/persistence.rs
index 8ca5b48e..10980fe3 100644
--- a/src/command/persistence.rs
+++ b/src/command/persistence.rs
@@ -7,9 +7,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
use bytes::Bytes;
use tracing::{error, info};
-use crate::runtime::channel;
-
-use crate::persistence::aof::AofMessage;
+use crate::persistence::aof::{AofMessage, AofPoolSendError, AofWriterPool};
use crate::persistence::rdb;
use crate::protocol::Frame;
use crate::storage::Database;
@@ -33,6 +31,21 @@ pub static BGSAVE_SHARDS_REMAINING: AtomicU64 = AtomicU64::new(0);
/// Whether the last BGSAVE completed successfully.
pub static BGSAVE_LAST_STATUS: AtomicBool = AtomicBool::new(true);
+/// Process-wide gate set at startup when the configuration combination
+/// `shards >= 2 + --disk-offload enable + --appendonly yes` is selected.
+///
+/// `BGREWRITEAOF` under this combination silently truncates the WAL of every
+/// shard except the rewriter's own shard while the consolidated multi-part AOF
+/// base RDB written by the rewrite is **not** consumed on restart (verified
+/// 2026-05-26 against HEAD `6e49050`: 38 % data loss reproducible). Until the
+/// v2.0 multi-part AOF replay walks every shard's segment manifest, the only
+/// safe behavior is to refuse the command in this config and point operators
+/// at the runbook.
+///
+/// Set once in `main.rs` after CLI parsing; never cleared. Checked by
+/// `bgrewriteaof_start_sharded` before dispatching the rewrite message.
+pub static MULTI_SHARD_AOF_REWRITE_UNSAFE: AtomicBool = AtomicBool::new(false);
+
/// Global flag indicating whether a BGREWRITEAOF rewrite is currently in progress.
///
/// Set to `true` when `bgrewriteaof_start` or `bgrewriteaof_start_sharded` dispatches
@@ -208,14 +221,31 @@ pub fn bgsave_shard_done(success: bool) {
}
}
+/// Translate an `AofWriterPool` send failure into a user-facing RESP error.
+/// Under PerShard layout, `pool.try_send_rewrite` returns
+/// `RewriteUnsupportedInPerShard` — the per-shard rewrite path lands in
+/// step 6 of the per-shard AOF RFC. Until then BGREWRITEAOF refuses with
+/// a stable error rather than silently no-op'ing.
+fn rewrite_pool_error_frame(err: AofPoolSendError) -> Frame {
+ match err {
+ AofPoolSendError::RewriteUnsupportedInPerShard => Frame::Error(Bytes::from_static(
+ b"ERR BGREWRITEAOF is not yet supported under per-shard AOF layout; per-shard rewrite ships in step 6 of the per-shard AOF migration",
+ )),
+ AofPoolSendError::SendFailed => Frame::Error(Bytes::from_static(
+ b"ERR Background AOF rewrite failed to start",
+ )),
+ }
+}
+
/// Start a background AOF rewrite (BGREWRITEAOF command).
///
-/// Sends a Rewrite message to the AOF writer task, which will generate
-/// synthetic commands from current database state and replace the AOF file.
+/// Submits a Rewrite message through the writer pool, which generates
+/// synthetic commands from current database state and replaces the AOF
+/// file.
///
/// Uses CAS to set `AOF_REWRITE_IN_PROGRESS`: if a rewrite is already running,
/// returns an error immediately without corrupting the in-flight rewrite state.
-pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender, db: SharedDatabases) -> Frame {
+pub fn bgrewriteaof_start(pool: &AofWriterPool, db: SharedDatabases) -> Frame {
// CAS: only proceed if currently false; prevents a second caller from
// clearing the flag while the first rewrite is still in progress.
if AOF_REWRITE_IN_PROGRESS
@@ -226,16 +256,15 @@ pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender, db: SharedDa
b"ERR Background AOF rewrite already in progress",
));
}
- match aof_tx.try_send(AofMessage::Rewrite(db)) {
+ match pool.try_send_rewrite(AofMessage::Rewrite(db)) {
Ok(()) => Frame::SimpleString(Bytes::from_static(
b"Background append only file rewriting started",
)),
- Err(_) => {
- // Channel send failed — rewrite never started; we set the flag so we clear it.
+ Err(e) => {
+ // Send failed (channel full) or PerShard rejection — rewrite never
+ // started, so clear the in-progress flag we just set.
AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
- Frame::Error(Bytes::from_static(
- b"ERR Background AOF rewrite failed to start",
- ))
+ rewrite_pool_error_frame(e)
}
}
}
@@ -245,9 +274,18 @@ pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender, db: SharedDa
/// Uses CAS to set `AOF_REWRITE_IN_PROGRESS`: if a rewrite is already running,
/// returns an error immediately without corrupting the in-flight rewrite state.
pub fn bgrewriteaof_start_sharded(
- aof_tx: &channel::MpscSender,
+ pool: &AofWriterPool,
shard_databases: std::sync::Arc,
) -> Frame {
+ // Refuse the rewrite under the known-unsafe config combo (see the
+ // MULTI_SHARD_AOF_REWRITE_UNSAFE doc comment). This is the
+ // single-node v1.0-rc1 gate; the v2.0 multi-part AOF replay fix lifts
+ // it.
+ if MULTI_SHARD_AOF_REWRITE_UNSAFE.load(Ordering::Relaxed) {
+ return Frame::Error(Bytes::from_static(
+ b"ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable + --appendonly yes (known data-loss bug; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1, set --disk-offload disable, or wait for v2.0 multi-part AOF replay.",
+ ));
+ }
// CAS: only proceed if currently false; prevents a second caller from
// clearing the flag while the first rewrite is still in progress.
if AOF_REWRITE_IN_PROGRESS
@@ -258,16 +296,15 @@ pub fn bgrewriteaof_start_sharded(
b"ERR Background AOF rewrite already in progress",
));
}
- match aof_tx.try_send(AofMessage::RewriteSharded(shard_databases)) {
+ match pool.try_send_rewrite(AofMessage::RewriteSharded(shard_databases)) {
Ok(()) => Frame::SimpleString(Bytes::from_static(
b"Background append only file rewriting started",
)),
- Err(_) => {
- // Channel send failed — rewrite never started; we set the flag so we clear it.
+ Err(e) => {
+ // Send failed (channel full) or PerShard rejection — rewrite never
+ // started, so clear the in-progress flag we just set.
AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
- Frame::Error(Bytes::from_static(
- b"ERR Background AOF rewrite failed to start",
- ))
+ rewrite_pool_error_frame(e)
}
}
}
@@ -358,4 +395,66 @@ mod tests {
// Note: verify the static exists and is accessible
let _ = BGSAVE_LAST_STATUS.load(Ordering::Relaxed);
}
+
+ // Serialize the multi-shard gate test against any other test mutating
+ // the gate or AOF_REWRITE_IN_PROGRESS (parallel test runner otherwise
+ // races on the process-wide AtomicBools).
+ static GATE_TEST_LOCK: parking_lot::Mutex<()> = parking_lot::Mutex::new(());
+
+ #[test]
+ fn test_bgrewriteaof_sharded_refuses_under_unsafe_config() {
+ let _guard = GATE_TEST_LOCK.lock();
+ // Use a small bounded channel so the test does not need an AOF
+ // writer task; the gate must fire BEFORE try_send is reached.
+ // Wrap as a TopLevel pool to match the post-2e-β helper signature.
+ let (tx, _rx) = crate::runtime::channel::mpsc_bounded::(1);
+ let pool = AofWriterPool::top_level(tx);
+ let shard_dbs = crate::shard::shared_databases::ShardDatabases::new(
+ vec![vec![crate::storage::Database::new()]],
+ );
+
+ // Snapshot prior state so the test is order-independent.
+ let prior = MULTI_SHARD_AOF_REWRITE_UNSAFE.load(Ordering::Relaxed);
+ let prior_in_progress = AOF_REWRITE_IN_PROGRESS.load(Ordering::SeqCst);
+ AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+
+ // Gate ON → must refuse with the documented ERR (and must NOT flip
+ // AOF_REWRITE_IN_PROGRESS, otherwise a normal rewrite gets blocked).
+ MULTI_SHARD_AOF_REWRITE_UNSAFE.store(true, Ordering::Relaxed);
+ let frame = bgrewriteaof_start_sharded(&pool, shard_dbs.clone());
+ match frame {
+ Frame::Error(msg) => {
+ let s = std::str::from_utf8(&msg).unwrap();
+ assert!(
+ s.contains("BGREWRITEAOF is unsafe")
+ && s.contains("multi-shard-aof-rewrite.md"),
+ "unexpected error: {s}"
+ );
+ }
+ other => panic!("expected Frame::Error, got {other:?}"),
+ }
+ assert!(
+ !AOF_REWRITE_IN_PROGRESS.load(Ordering::SeqCst),
+ "gate must not set AOF_REWRITE_IN_PROGRESS"
+ );
+
+ // Gate OFF → the gate error must NOT fire. (Without an AOF writer
+ // task draining the channel, the second call may succeed or return
+ // "failed to start" depending on buffer state; the contract under
+ // test here is only that the gate error is gone.)
+ MULTI_SHARD_AOF_REWRITE_UNSAFE.store(false, Ordering::Relaxed);
+ AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst);
+ let frame2 = bgrewriteaof_start_sharded(&pool, shard_dbs);
+ if let Frame::Error(msg) = &frame2 {
+ let s = std::str::from_utf8(msg).unwrap();
+ assert!(
+ !s.contains("BGREWRITEAOF is unsafe"),
+ "gate error fired with gate off: {s}"
+ );
+ }
+
+ // Restore prior state.
+ AOF_REWRITE_IN_PROGRESS.store(prior_in_progress, Ordering::SeqCst);
+ MULTI_SHARD_AOF_REWRITE_UNSAFE.store(prior, Ordering::Relaxed);
+ }
}
diff --git a/src/config.rs b/src/config.rs
index b384a818..8ec55ad4 100644
--- a/src/config.rs
+++ b/src/config.rs
@@ -102,6 +102,17 @@ pub struct ServerConfig {
#[arg(long, default_value = "yes")]
pub appendonly: String,
+ /// Acknowledge the known multi-shard AOF durability bug and start
+ /// anyway. Verified 2026-05-26 on HEAD `6e49050`:
+ /// `--shards >= 2 + --appendonly yes` loses ~50 % of writes on SIGKILL
+ /// regardless of `--appendfsync` or `--disk-offload` settings. Until
+ /// the v2.0 multi-shard AOF replay lands, Moon refuses this config at
+ /// startup; pass this flag to override (e.g. cache-only deployments
+ /// where the loss is acceptable). See
+ /// `docs/runbooks/multi-shard-aof-rewrite.md`.
+ #[arg(long, default_value_t = false)]
+ pub unsafe_multishard_aof: bool,
+
/// AOF fsync policy (always/everysec/no)
#[arg(long, default_value = "everysec")]
pub appendfsync: String,
diff --git a/src/main.rs b/src/main.rs
index 8bff34ac..0e71efc5 100644
--- a/src/main.rs
+++ b/src/main.rs
@@ -44,7 +44,7 @@ use std::path::PathBuf;
use clap::Parser;
use moon::config::ServerConfig;
-use moon::persistence::aof::{self, AofMessage, FsyncPolicy};
+use moon::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy};
use moon::runtime::cancel::CancellationToken;
use moon::runtime::channel;
use moon::runtime::{RuntimeFactoryImpl, traits::RuntimeFactory};
@@ -270,6 +270,24 @@ fn main() -> anyhow::Result<()> {
info!("Starting with {} shards", num_shards);
+ // P0-FIX-01b: refuse to start under the known durability bug
+ // (`shards >= 2 + appendonly yes` loses ~50 % of writes on SIGKILL,
+ // verified 2026-05-26 on HEAD `6e49050`; reproducer in
+ // `tmp/p0-no-rewrite.sh` and `tmp/p0-always.sh`). The bug is
+ // independent of `--appendfsync` and `--disk-offload` settings. An
+ // operator can override via `--unsafe-multishard-aof` if the
+ // deployment is cache-only and the loss window is acceptable.
+ if num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof {
+ eprintln!(
+ "REFUSING TO START: --shards {num_shards} + --appendonly yes has a known data-loss \
+ bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, or pass \
+ --appendonly no for cache-only deployments, or pass --unsafe-multishard-aof to \
+ acknowledge the risk and start anyway. See \
+ docs/runbooks/multi-shard-aof-rewrite.md."
+ );
+ std::process::exit(2);
+ }
+
// T1.1: warn when maxclients < 25 × shards (undersubscription footgun).
// Suppressed by MOON_NO_UNDERSUBSCRIPTION_WARN=1.
if let Some(msg) = should_warn_undersubscription(config.maxclients, num_shards)
@@ -287,30 +305,115 @@ fn main() -> anyhow::Result<()> {
// Collect connection senders for the listener before spawning shard threads
let conn_txs: Vec<_> = (0..num_shards).map(|i| mesh.conn_tx(i)).collect();
- // Set up AOF channel: single writer, all shards send to it via mpsc::Sender clones.
- // The AOF writer task will be spawned on the listener runtime.
- let aof_tx: Option> = if config.appendonly == "yes" {
- let (tx, rx) = channel::mpsc_bounded::(10_000);
- let aof_token = cancel_token.child_token();
- let fsync = FsyncPolicy::from_str(&config.appendfsync);
- let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename);
- // AOF writer task will be spawned on the listener runtime (see below)
- // We store rx to spawn later since listener_rt hasn't been created yet.
- // Instead, spawn on a dedicated thread so it's available before listener starts.
- std::thread::Builder::new()
- .name("aof-writer".to_string())
- .spawn(move || {
- RuntimeFactoryImpl::block_on_local(
- "aof-writer".to_string(),
- aof::aof_writer_task(rx, aof_file_path, fsync, aof_token),
+ // Set up AOF writer channel(s) + `AofWriterPool` (step 2f-β: layout-aware).
+ //
+ // Logic:
+ // 1. If `appendonly == "yes"` and an existing on-disk manifest is found,
+ // verify its shard count matches `--shards` (RFC § 3 refusal — a
+ // mismatch silently maps shards to the wrong AOF files and is fatal).
+ // 2. If the manifest's layout is `PerShard` AND `num_shards >= 2`,
+ // spawn one writer per shard (`aof-writer-{N}` threads) and emit a
+ // `AofWriterPool::per_shard(senders)`.
+ // 3. Otherwise spawn the single legacy writer and emit
+ // `AofWriterPool::top_level(tx)`. This includes:
+ // - no manifest yet (fresh install — `initialize()` only writes
+ // TopLevel today; fresh-install PerShard creation lands later
+ // in the RFC sequence)
+ // - existing TopLevel manifest (legacy v1 or single-shard v2)
+ // - `num_shards == 1` (always TopLevel; per-shard fan-out has no
+ // meaning when there is one shard)
+ //
+ // A *corrupt* manifest is fatal — `AofManifest::load` returning `Err(_)`
+ // must NOT silently fall back to TopLevel, because the next write would
+ // create a fresh manifest overwriting the reference to the real base RDB
+ // and lose data. This mirrors the replay block at L514–526.
+ //
+ // Note: nothing today constructs a `layout == PerShard` manifest on disk
+ // (initialize() hardcodes TopLevel, migrate_top_level_to_per_shard is not
+ // yet wired into boot). The PerShard branch is reachable only by a
+ // hand-crafted manifest until step 9 lifts the multi-shard gate. Runtime
+ // behavior under default configurations stays byte-identical to step 2f-α.
+ use moon::persistence::aof_manifest::{AofLayout, AofManifest};
+ let existing_manifest: Option = if config.appendonly == "yes" {
+ let base_dir = PathBuf::from(&config.dir);
+ match AofManifest::load(&base_dir) {
+ Ok(opt) => opt,
+ Err(e) => {
+ eprintln!(
+ "REFUSING TO START: AOF manifest at {}/appendonlydir/ is corrupt: {}. \
+ Inspect manually before deleting; overwriting silently loses data.",
+ base_dir.display(),
+ e
);
- })
- .expect("failed to spawn AOF writer thread");
- info!(
- "AOF enabled with fsync policy: {:?}",
- FsyncPolicy::from_str(&config.appendfsync)
- );
- Some(tx)
+ std::process::exit(2);
+ }
+ }
+ } else {
+ None
+ };
+ if let Some(ref m) = existing_manifest
+ && let Err(e) = m.verify_shard_count(num_shards as u16)
+ {
+ eprintln!("REFUSING TO START: {e}");
+ std::process::exit(2);
+ }
+
+ let aof_pool: Option> = if config.appendonly == "yes" {
+ let fsync = FsyncPolicy::from_str(&config.appendfsync);
+ let use_per_shard = matches!(
+ existing_manifest.as_ref().map(|m| m.layout),
+ Some(AofLayout::PerShard)
+ ) && num_shards >= 2;
+
+ if use_per_shard {
+ let base_dir = PathBuf::from(&config.dir);
+ let mut senders = Vec::with_capacity(num_shards);
+ for sid in 0..num_shards {
+ let (tx, rx) = channel::mpsc_bounded::(10_000);
+ let aof_token = cancel_token.child_token();
+ let base_dir = base_dir.clone();
+ let thread_name = format!("aof-writer-{sid}");
+ let thread_name_inner = thread_name.clone();
+ std::thread::Builder::new()
+ .name(thread_name)
+ .spawn(move || {
+ RuntimeFactoryImpl::block_on_local(
+ thread_name_inner,
+ aof::per_shard_aof_writer_task(
+ rx,
+ base_dir,
+ sid as u16,
+ fsync,
+ aof_token,
+ ),
+ );
+ })
+ .expect("failed to spawn per-shard AOF writer thread");
+ senders.push(tx);
+ }
+ info!(
+ "AOF enabled (PerShard, {} writers, fsync: {:?})",
+ num_shards, fsync
+ );
+ Some(AofWriterPool::per_shard(senders))
+ } else {
+ let (tx, rx) = channel::mpsc_bounded::(10_000);
+ let aof_token = cancel_token.child_token();
+ let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename);
+ // Legacy single-writer thread. Each shard clones the outer
+ // `aof_pool` Arc; sender lifetime is governed by the pool's Drop.
+ std::thread::Builder::new()
+ .name("aof-writer".to_string())
+ .spawn(move || {
+ RuntimeFactoryImpl::block_on_local(
+ "aof-writer".to_string(),
+ aof::aof_writer_task(rx, aof_file_path, fsync, aof_token),
+ );
+ })
+ .expect("failed to spawn AOF writer thread");
+ info!("AOF enabled (TopLevel, fsync: {:?})", fsync);
+ Some(AofWriterPool::top_level(tx))
+ }
} else {
None
};
@@ -318,6 +421,23 @@ fn main() -> anyhow::Result<()> {
// Compute bind address for SO_REUSEPORT per-shard listeners (Linux io_uring path).
let bind_addr = format!("{}:{}", config.bind, config.port);
+ // P0-FIX-01: gate BGREWRITEAOF under the known data-loss config combo
+ // (multi-shard + disk-offload enabled + appendonly). Verified 2026-05-26:
+ // the rewrite truncates non-rewriter shards' WALs and the consolidated
+ // multi-part AOF base RDB is not consumed on restart, losing ~38 % of
+ // keys. v2.0 multi-part AOF replay lifts this; until then we refuse the
+ // command at dispatch time. See docs/runbooks/multi-shard-aof-rewrite.md.
+ if num_shards >= 2 && config.disk_offload_enabled() && config.appendonly == "yes" {
+ moon::command::persistence::MULTI_SHARD_AOF_REWRITE_UNSAFE
+ .store(true, std::sync::atomic::Ordering::Relaxed);
+ tracing::warn!(
+ shards = num_shards,
+ disk_offload = %config.disk_offload,
+ appendonly = %config.appendonly,
+ "BGREWRITEAOF gated for this config (known data-loss path; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or --disk-offload disable to re-enable rewrite."
+ );
+ }
+
// Create watch channel for snapshot triggers (auto-save and BGSAVE)
let (snapshot_trigger_tx, snapshot_trigger_rx) = moon::runtime::channel::watch(0u64);
@@ -644,7 +764,7 @@ fn main() -> anyhow::Result<()> {
}
let conn_rx = mesh.take_conn_rx(id);
let shard_cancel = cancel_token.clone();
- let shard_aof_tx = aof_tx.clone();
+ let shard_aof_pool = aof_pool.clone();
let shard_bind_addr = bind_addr.clone();
let shard_persistence_dir = persistence_dir.clone();
let shard_snap_rx = snapshot_trigger_rx.clone();
@@ -676,7 +796,7 @@ fn main() -> anyhow::Result<()> {
consumers,
producers,
shard_cancel,
- shard_aof_tx,
+ shard_aof_pool,
// Only pass bind_addr for per-shard SO_REUSEPORT when tokio
// with io_uring is active. monoio uses central listener MPSC.
#[cfg(feature = "runtime-tokio")]
@@ -862,9 +982,11 @@ fn main() -> anyhow::Result<()> {
});
}
- // After listener exits, send AOF shutdown and cancel all shards
- if let Some(ref tx) = aof_tx {
- let _ = tx.send(AofMessage::Shutdown);
+ // After listener exits, send AOF shutdown to every writer and cancel all shards.
+ // Under TopLevel this is one send; under PerShard (step 2f-β) this fans out to
+ // every per-shard writer thread via `broadcast_shutdown`.
+ if let Some(ref pool) = aof_pool {
+ pool.broadcast_shutdown();
}
cancel_token.cancel();
for handle in shard_handles {
diff --git a/src/persistence/aof.rs b/src/persistence/aof.rs
index a11bb7c5..e97c2910 100644
--- a/src/persistence/aof.rs
+++ b/src/persistence/aof.rs
@@ -57,8 +57,23 @@ impl FsyncPolicy {
/// Messages sent to the AOF writer task via mpsc channel.
pub enum AofMessage {
- /// Append serialized RESP command bytes to the AOF file.
- Append(Bytes),
+ /// Append serialized RESP command bytes to the AOF file, tagged with the
+ /// LSN that was issued for this write (`ReplicationState::issue_lsn`).
+ ///
+ /// `lsn` semantics by writer task:
+ /// - **TopLevel** (`aof_writer_task`): `lsn` is **ignored**; the legacy
+ /// v1 disk format is plain RESP bytes with no per-entry framing.
+ /// - **PerShard** (`per_shard_aof_writer_task`): `lsn` is **written** as
+ /// a u64 header per RFC § 2 Rule 1. Disk format per entry:
+ /// `[u64 lsn LE][u32 len LE][RESP bytes of length len]`.
+ /// Recovery reads `(lsn, cmd)` pairs and merges cross-shard
+ /// `OrderedAcrossShards` writes by LSN (RFC § 2 Rule 2).
+ ///
+ /// Construction sites that issue a real LSN call
+ /// `ReplicationState::issue_lsn(shard_id, bytes.len() as u64)` and pass
+ /// the returned value. Sites with no replication state available pass 0
+ /// (TopLevel ignores it; PerShard treats 0 as "no ordering hint").
+ Append { lsn: u64, bytes: Bytes },
/// Trigger a full AOF rewrite (compaction) using current database state.
Rewrite(SharedDatabases),
/// Trigger AOF rewrite in sharded mode (all shards' databases).
@@ -67,6 +82,298 @@ pub enum AofMessage {
Shutdown,
}
+/// Reasons a pool send may be refused without queueing.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub enum AofPoolSendError {
+ /// `Rewrite`/`RewriteSharded` sent to a `PerShard` pool. BGREWRITEAOF must
+ /// be issued per shard in the per-shard layout; the legacy single-writer
+ /// rewrite path is not applicable.
+ RewriteUnsupportedInPerShard,
+ /// Underlying channel send failed (writer task dead or channel full).
+ SendFailed,
+}
+
+/// Bundle of per-shard AOF writer senders.
+///
+/// The pool keeps the call-site API uniform regardless of layout:
+/// - **TopLevel** (legacy v1, single-shard, also used for `--shards 1` v2):
+/// exactly one writer thread; every `sender(shard_id)` returns the same
+/// sender so all shards multiplex onto one file.
+/// - **PerShard** (v2 multi-shard): one writer per shard; `sender(shard_id)`
+/// returns the writer that owns `appendonlydir/shard-{shard_id}/`.
+///
+/// Step 2a is additive — this type is defined here but no call site is wired
+/// to it yet. Step 2c performs the type plumbing in `conn_state` and
+/// `conn/core`; steps 2d/2e/2f update the call sites and spawn paths.
+#[derive(Clone)]
+pub struct AofWriterPool {
+ senders: Vec>,
+ layout: crate::persistence::aof_manifest::AofLayout,
+}
+
+impl AofWriterPool {
+ /// Build a TopLevel pool from a single existing writer sender. Used for
+ /// legacy v1 deployments and `--shards 1` v2 deployments where one writer
+ /// thread services every shard.
+ pub fn top_level(sender: channel::MpscSender) -> Arc {
+ Arc::new(Self {
+ senders: vec![sender],
+ layout: crate::persistence::aof_manifest::AofLayout::TopLevel,
+ })
+ }
+
+ /// Build a PerShard pool from N senders. `senders[i]` MUST be the writer
+ /// task that owns `appendonlydir/shard-{i}/`. The vector's length is the
+ /// shard count; passing a length-1 vector here is a bug — use
+ /// [`AofWriterPool::top_level`] instead.
+ pub fn per_shard(senders: Vec>) -> Arc {
+ debug_assert!(
+ senders.len() >= 2,
+ "per_shard pool needs >=2 writers; use top_level for single-writer"
+ );
+ Arc::new(Self {
+ senders,
+ layout: crate::persistence::aof_manifest::AofLayout::PerShard,
+ })
+ }
+
+ /// Return the writer sender that owns the given shard's AOF file.
+ ///
+ /// For TopLevel pools, `shard_id` is ignored — all shards multiplex onto
+ /// the single sender. For PerShard pools, `shard_id` MUST be in range
+ /// `[0, num_writers())`; an out-of-range id is a programmer error and
+ /// panics in debug builds.
+ #[inline]
+ pub fn sender(&self, shard_id: usize) -> &channel::MpscSender {
+ use crate::persistence::aof_manifest::AofLayout;
+ match self.layout {
+ AofLayout::TopLevel => &self.senders[0],
+ AofLayout::PerShard => {
+ debug_assert!(
+ shard_id < self.senders.len(),
+ "shard_id {} out of range for per-shard pool of size {}",
+ shard_id,
+ self.senders.len()
+ );
+ &self.senders[shard_id]
+ }
+ }
+ }
+
+ /// Fire-and-forget append for the given shard, tagged with the LSN that
+ /// was issued for this write (see [`AofMessage::Append`] docs for LSN
+ /// semantics per layout). Call sites must source `lsn` from
+ /// `ReplicationState::issue_lsn(shard_id, bytes.len() as u64)` for writes
+ /// that participate in replication ordering; sites without a
+ /// replication-state handle pass 0.
+ #[inline]
+ pub fn try_send_append(&self, shard_id: usize, lsn: u64, bytes: Bytes) {
+ let _ = self
+ .sender(shard_id)
+ .try_send(AofMessage::Append { lsn, bytes });
+ }
+
+ /// Issue an LSN for an AOF append at every call site that has the
+ /// `Option>>` shape. Wraps
+ /// `ReplicationState::issue_lsn` so handler call sites collapse to a
+ /// single line.
+ ///
+ /// Returns 0 when:
+ /// - `repl_state` is None (test fixtures or shutdown paths)
+ /// - the `RwLock` is poisoned (shouldn't happen in production —
+ /// ReplicationState is only `write()`-locked under known-safe paths)
+ ///
+ /// 0 is a sentinel meaning "no replication ordering for this write".
+ /// TopLevel writers ignore the LSN entirely so 0 is harmless there;
+ /// PerShard writers treat 0 the same as any other LSN (per-shard order
+ /// is preserved by write order, not by LSN value). The LSN only matters
+ /// for the cross-shard `OrderedAcrossShards` merge in RFC step 5.
+ #[inline]
+ pub fn issue_append_lsn(
+ repl_state: &Option>>,
+ shard_id: usize,
+ delta: usize,
+ ) -> u64 {
+ repl_state
+ .as_ref()
+ .and_then(|rs| {
+ rs.read()
+ .ok()
+ .map(|g| g.issue_lsn(shard_id, delta as u64))
+ })
+ .unwrap_or(0)
+ }
+
+ /// Submit a Rewrite/RewriteSharded message. Only legal for TopLevel pools;
+ /// PerShard rewrites are per-shard operations and must be initiated by
+ /// the BGREWRITEAOF code path in step 6, not via this enum variant.
+ pub fn try_send_rewrite(&self, msg: AofMessage) -> Result<(), AofPoolSendError> {
+ use crate::persistence::aof_manifest::AofLayout;
+ debug_assert!(
+ matches!(msg, AofMessage::Rewrite(_) | AofMessage::RewriteSharded(_)),
+ "try_send_rewrite called with a non-Rewrite variant",
+ );
+ if self.layout == AofLayout::PerShard {
+ return Err(AofPoolSendError::RewriteUnsupportedInPerShard);
+ }
+ self.senders[0]
+ .try_send(msg)
+ .map_err(|_| AofPoolSendError::SendFailed)
+ }
+
+ /// Broadcast `Shutdown` to every writer. Used by orchestrated shutdown
+ /// paths in `main.rs`/`embedded.rs`. Each writer drains its channel and
+ /// fsyncs before exiting.
+ pub fn broadcast_shutdown(&self) {
+ for s in &self.senders {
+ let _ = s.try_send(AofMessage::Shutdown);
+ }
+ }
+
+ /// Number of underlying writer senders. 1 for TopLevel, num_shards for
+ /// PerShard.
+ #[inline]
+ pub fn num_writers(&self) -> usize {
+ self.senders.len()
+ }
+
+ /// Reports the pool's layout. Useful for places that need to refuse
+ /// PerShard-incompatible legacy code paths with a clear error.
+ #[inline]
+ pub fn layout(&self) -> crate::persistence::aof_manifest::AofLayout {
+ self.layout
+ }
+}
+
+#[cfg(test)]
+mod pool_tests {
+ use super::*;
+ use crate::persistence::aof_manifest::AofLayout;
+ use crate::runtime::channel;
+
+ #[test]
+ fn top_level_pool_routes_all_shards_to_writer_zero() {
+ let (tx, rx) = channel::mpsc_bounded::(8);
+ let pool = AofWriterPool::top_level(tx);
+ assert_eq!(pool.num_writers(), 1);
+ assert_eq!(pool.layout(), AofLayout::TopLevel);
+
+ pool.try_send_append(0, 0, Bytes::from_static(b"a"));
+ pool.try_send_append(7, 0, Bytes::from_static(b"b"));
+ pool.try_send_append(42, 0, Bytes::from_static(b"c"));
+
+ let mut seen = 0;
+ while rx.try_recv().is_ok() {
+ seen += 1;
+ }
+ assert_eq!(seen, 3, "all 3 appends should land on writer 0");
+ }
+
+ #[test]
+ fn per_shard_pool_routes_each_shard_to_its_own_writer() {
+ let (tx0, rx0) = channel::mpsc_bounded::(8);
+ let (tx1, rx1) = channel::mpsc_bounded::(8);
+ let (tx2, rx2) = channel::mpsc_bounded::(8);
+ let pool = AofWriterPool::per_shard(vec![tx0, tx1, tx2]);
+ assert_eq!(pool.num_writers(), 3);
+ assert_eq!(pool.layout(), AofLayout::PerShard);
+
+ pool.try_send_append(0, 100, Bytes::from_static(b"shard0"));
+ pool.try_send_append(1, 200, Bytes::from_static(b"shard1a"));
+ pool.try_send_append(1, 300, Bytes::from_static(b"shard1b"));
+ pool.try_send_append(2, 400, Bytes::from_static(b"shard2"));
+
+ let count = |rx: &channel::MpscReceiver| -> usize {
+ let mut n = 0;
+ while rx.try_recv().is_ok() {
+ n += 1;
+ }
+ n
+ };
+ assert_eq!(count(&rx0), 1, "shard 0 writer should receive exactly 1");
+ assert_eq!(count(&rx1), 2, "shard 1 writer should receive exactly 2");
+ assert_eq!(count(&rx2), 1, "shard 2 writer should receive exactly 1");
+ }
+
+ #[test]
+ fn per_shard_pool_rejects_rewrite_with_explicit_error() {
+ let (tx0, _rx0) = channel::mpsc_bounded::(8);
+ let (tx1, _rx1) = channel::mpsc_bounded::(8);
+ let pool = AofWriterPool::per_shard(vec![tx0, tx1]);
+
+ let dummies: SharedDatabases = Arc::new(vec![]);
+ let err = pool.try_send_rewrite(AofMessage::Rewrite(dummies)).unwrap_err();
+ assert_eq!(err, AofPoolSendError::RewriteUnsupportedInPerShard);
+ }
+
+ #[test]
+ fn top_level_pool_accepts_rewrite() {
+ let (tx, rx) = channel::mpsc_bounded::(8);
+ let pool = AofWriterPool::top_level(tx);
+
+ let dummies: SharedDatabases = Arc::new(vec![]);
+ pool.try_send_rewrite(AofMessage::Rewrite(dummies)).unwrap();
+ assert!(matches!(rx.try_recv(), Ok(AofMessage::Rewrite(_))));
+ }
+
+ #[test]
+ fn per_shard_pool_threads_lsn_field_to_each_writer() {
+ // Step 3 wire-format contract: try_send_append carries the issued LSN
+ // through to the writer task, which writes it as the per-entry header
+ // under PerShard layout. This unit test pins the channel-side contract
+ // (the disk-side framing is covered by writer-task integration).
+ let (tx0, rx0) = channel::mpsc_bounded::(4);
+ let (tx1, rx1) = channel::mpsc_bounded::(4);
+ let pool = AofWriterPool::per_shard(vec![tx0, tx1]);
+
+ pool.try_send_append(0, 42, Bytes::from_static(b"set foo 1"));
+ pool.try_send_append(1, 43, Bytes::from_static(b"set bar 2"));
+ pool.try_send_append(0, 44, Bytes::from_static(b"del foo"));
+
+ // Shard 0 should see (42, "set foo 1") then (44, "del foo").
+ match rx0.try_recv() {
+ Ok(AofMessage::Append { lsn, bytes }) => {
+ assert_eq!(lsn, 42, "shard 0 first entry lsn");
+ assert_eq!(bytes.as_ref(), b"set foo 1");
+ }
+ other => panic!("shard 0 first recv expected Append, got {:?}", other.is_ok()),
+ }
+ match rx0.try_recv() {
+ Ok(AofMessage::Append { lsn, bytes }) => {
+ assert_eq!(lsn, 44, "shard 0 second entry lsn");
+ assert_eq!(bytes.as_ref(), b"del foo");
+ }
+ other => panic!("shard 0 second recv expected Append, got {:?}", other.is_ok()),
+ }
+ // Shard 1 should see (43, "set bar 2") only.
+ match rx1.try_recv() {
+ Ok(AofMessage::Append { lsn, bytes }) => {
+ assert_eq!(lsn, 43, "shard 1 entry lsn");
+ assert_eq!(bytes.as_ref(), b"set bar 2");
+ }
+ other => panic!("shard 1 recv expected Append, got {:?}", other.is_ok()),
+ }
+ }
+
+ #[test]
+ fn broadcast_shutdown_reaches_every_writer() {
+ let (tx0, rx0) = channel::mpsc_bounded::(2);
+ let (tx1, rx1) = channel::mpsc_bounded::(2);
+ let (tx2, rx2) = channel::mpsc_bounded::(2);
+ let pool = AofWriterPool::per_shard(vec![tx0, tx1, tx2]);
+
+ pool.broadcast_shutdown();
+
+ for (i, rx) in [&rx0, &rx1, &rx2].iter().enumerate() {
+ assert!(
+ matches!(rx.try_recv(), Ok(AofMessage::Shutdown)),
+ "writer {} did not receive Shutdown",
+ i
+ );
+ }
+ }
+}
+
/// Serialize a Frame into RESP wire format bytes.
pub fn serialize_command(frame: &Frame) -> Bytes {
let mut buf = BytesMut::with_capacity(64);
@@ -194,7 +501,10 @@ pub async fn aof_writer_task(
loop {
match rx.recv() {
- Ok(AofMessage::Append(data)) => {
+ // TopLevel writer: legacy v1 disk format is plain RESP. The
+ // LSN is ignored — TopLevel is single-shard so per-shard merge
+ // by LSN is moot.
+ Ok(AofMessage::Append { bytes: data, lsn: _ }) => {
if write_error {
continue; // Drop appends after persistent I/O failure
}
@@ -287,7 +597,8 @@ pub async fn aof_writer_task(
tokio::select! {
msg = rx.recv_async() => {
match msg {
- Ok(AofMessage::Append(data)) => {
+ // TopLevel writer (tokio): legacy v1 plain RESP, lsn ignored.
+ Ok(AofMessage::Append { bytes: data, lsn: _ }) => {
if let Err(e) = writer.write_all(&data).await {
error!("AOF write error: {}", e);
continue;
@@ -369,6 +680,376 @@ pub async fn aof_writer_task(
}
}
+/// Background per-shard AOF writer task (Option B step 2b).
+///
+/// One instance is spawned per shard in `PerShard` layout. Each instance owns
+/// `appendonlydir/shard-{shard_id}/moon.aof.{seq}.incr.aof` exclusively — no
+/// other writer touches that file, so there is no per-file locking.
+///
+/// Differences from [`aof_writer_task`] (TopLevel):
+/// - Opens `manifest.shard_incr_path(shard_id)` instead of `manifest.incr_path()`.
+/// - `Rewrite`/`RewriteSharded` variants are rejected (logged + dropped).
+/// The legacy single-writer rewrite enum has no meaning when each shard
+/// owns its own files; per-shard BGREWRITEAOF lands in RFC step 6.
+/// - Refuses to start if the loaded manifest's layout is `TopLevel` — the
+/// spawn site (step 2f) must only invoke this task body for `PerShard`
+/// layouts. Mismatch is a programmer error.
+///
+/// Wait/timeout/corruption semantics for manifest loading match the existing
+/// `aof_writer_task` (60s bounded wait, hard fail on corrupt manifest).
+pub async fn per_shard_aof_writer_task(
+ rx: channel::MpscReceiver,
+ base_dir: PathBuf,
+ shard_id: u16,
+ fsync: FsyncPolicy,
+ cancel: CancellationToken,
+) {
+ #[cfg(feature = "runtime-tokio")]
+ {
+ use crate::persistence::aof_manifest::{AofLayout, AofManifest};
+ use tokio::io::AsyncWriteExt;
+
+ // Wait for main.rs recovery to create/load the manifest.
+ let manifest_wait_start = Instant::now();
+ const MANIFEST_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60);
+ let manifest = loop {
+ if cancel.is_cancelled() {
+ info!(
+ "AOF writer shard {}: cancelled while waiting for manifest",
+ shard_id
+ );
+ return;
+ }
+ if manifest_wait_start.elapsed() > MANIFEST_TIMEOUT {
+ error!(
+ "AOF writer shard {}: manifest not found at {} after {:?}. Writer exiting.",
+ shard_id,
+ base_dir.display(),
+ MANIFEST_TIMEOUT,
+ );
+ return;
+ }
+ match AofManifest::load(&base_dir) {
+ Ok(Some(m)) => break m,
+ Ok(None) => {
+ tokio::time::sleep(std::time::Duration::from_millis(50)).await;
+ }
+ Err(e) => {
+ error!(
+ "AOF writer shard {}: manifest corrupt at {}: {}. Persistence disabled.",
+ shard_id,
+ base_dir.display(),
+ e
+ );
+ return;
+ }
+ }
+ };
+
+ if manifest.layout != AofLayout::PerShard {
+ error!(
+ "AOF writer shard {}: layout is {:?}, expected PerShard. \
+ per_shard_aof_writer_task should only be spawned for PerShard layouts. \
+ Writer exiting.",
+ shard_id, manifest.layout
+ );
+ return;
+ }
+ if (shard_id as usize) >= manifest.shards.len() {
+ error!(
+ "AOF writer shard {}: out of range for manifest with {} shards. Writer exiting.",
+ shard_id,
+ manifest.shards.len()
+ );
+ return;
+ }
+
+ let incr_path = manifest.shard_incr_path(shard_id);
+ // Ensure shard-{N}/ exists. The manifest constructor for PerShard
+ // already creates these, but be defensive — a manual deletion or
+ // a manifest written by an older binary could leave them missing.
+ if let Some(parent) = incr_path.parent() {
+ if let Err(e) = tokio::fs::create_dir_all(parent).await {
+ error!(
+ "AOF writer shard {}: failed to create dir {}: {}",
+ shard_id,
+ parent.display(),
+ e
+ );
+ return;
+ }
+ }
+ let file: tokio::fs::File = match tokio::fs::OpenOptions::new()
+ .create(true)
+ .append(true)
+ .open(&incr_path)
+ .await
+ {
+ Ok(f) => f,
+ Err(e) => {
+ error!(
+ "AOF writer shard {}: failed to open incr {}: {}",
+ shard_id,
+ incr_path.display(),
+ e
+ );
+ return;
+ }
+ };
+ info!(
+ "AOF writer shard {}: seq {}, incr={}",
+ shard_id,
+ manifest.seq,
+ incr_path.display()
+ );
+
+ let mut writer = tokio::io::BufWriter::new(file);
+ let mut last_fsync = Instant::now();
+ let mut interval = tokio::time::interval(std::time::Duration::from_secs(1));
+ interval.tick().await;
+
+ loop {
+ tokio::select! {
+ msg = rx.recv_async() => {
+ match msg {
+ // PerShard writer (tokio): per RFC § 2 Rule 1 the on-disk
+ // format is `[u64 lsn LE][u32 len LE][RESP bytes]`. Header
+ // is written sequentially with the body — both calls land
+ // in the same BufWriter so this is one syscall under load.
+ Ok(AofMessage::Append { lsn, bytes: data }) => {
+ let mut header = [0u8; 12];
+ header[..8].copy_from_slice(&lsn.to_le_bytes());
+ header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes());
+ if let Err(e) = writer.write_all(&header).await {
+ error!("AOF header write error shard {}: {}", shard_id, e);
+ continue;
+ }
+ if let Err(e) = writer.write_all(&data).await {
+ error!("AOF write error shard {}: {}", shard_id, e);
+ continue;
+ }
+ if matches!(fsync, FsyncPolicy::Always) {
+ let _ = writer.flush().await;
+ let _ = writer.get_ref().sync_data().await;
+ }
+ }
+ Ok(AofMessage::Rewrite(_)) | Ok(AofMessage::RewriteSharded(_)) => {
+ warn!(
+ "AOF writer shard {}: received Rewrite/RewriteSharded — \
+ not supported in PerShard layout, dropped. \
+ Per-shard BGREWRITEAOF lands in RFC step 6.",
+ shard_id
+ );
+ }
+ Ok(AofMessage::Shutdown) | Err(_) => {
+ let _ = writer.flush().await;
+ let _ = writer.get_ref().sync_data().await;
+ info!("AOF writer shard {} shutting down", shard_id);
+ break;
+ }
+ }
+ }
+ _ = interval.tick(), if fsync == FsyncPolicy::EverySec => {
+ if last_fsync.elapsed() >= std::time::Duration::from_secs(1) {
+ let _ = writer.flush().await;
+ let _ = writer.get_ref().sync_data().await;
+ last_fsync = Instant::now();
+ }
+ }
+ _ = cancel.cancelled() => {
+ let _ = writer.flush().await;
+ let _ = writer.get_ref().sync_data().await;
+ info!("AOF writer shard {} cancelled", shard_id);
+ break;
+ }
+ }
+ }
+ }
+
+ #[cfg(feature = "runtime-monoio")]
+ {
+ use crate::persistence::aof_manifest::{AofLayout, AofManifest};
+ use std::io::Write;
+
+ let manifest_wait_start = Instant::now();
+ const MANIFEST_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60);
+ let manifest = loop {
+ if cancel.is_cancelled() {
+ info!(
+ "AOF writer shard {}: cancelled while waiting for manifest",
+ shard_id
+ );
+ return;
+ }
+ if manifest_wait_start.elapsed() > MANIFEST_TIMEOUT {
+ error!(
+ "AOF writer shard {}: manifest not found at {} after {:?}. Writer exiting.",
+ shard_id,
+ base_dir.display(),
+ MANIFEST_TIMEOUT,
+ );
+ return;
+ }
+ match AofManifest::load(&base_dir) {
+ Ok(Some(m)) => break m,
+ Ok(None) => {
+ std::thread::sleep(std::time::Duration::from_millis(50));
+ }
+ Err(e) => {
+ error!(
+ "AOF writer shard {}: manifest corrupt at {}: {}. Persistence disabled.",
+ shard_id,
+ base_dir.display(),
+ e
+ );
+ return;
+ }
+ }
+ };
+
+ if manifest.layout != AofLayout::PerShard {
+ error!(
+ "AOF writer shard {}: layout is {:?}, expected PerShard. Writer exiting.",
+ shard_id, manifest.layout
+ );
+ return;
+ }
+ if (shard_id as usize) >= manifest.shards.len() {
+ error!(
+ "AOF writer shard {}: out of range for manifest with {} shards. Writer exiting.",
+ shard_id,
+ manifest.shards.len()
+ );
+ return;
+ }
+
+ let incr_path = manifest.shard_incr_path(shard_id);
+ if let Some(parent) = incr_path.parent() {
+ if let Err(e) = std::fs::create_dir_all(parent) {
+ error!(
+ "AOF writer shard {}: failed to create dir {}: {}",
+ shard_id,
+ parent.display(),
+ e
+ );
+ return;
+ }
+ }
+ let mut file = match std::fs::OpenOptions::new()
+ .create(true)
+ .append(true)
+ .open(&incr_path)
+ {
+ Ok(f) => f,
+ Err(e) => {
+ error!(
+ "AOF writer shard {}: failed to open incr {}: {}",
+ shard_id,
+ incr_path.display(),
+ e
+ );
+ return;
+ }
+ };
+ info!(
+ "AOF writer shard {}: seq {}, incr={}",
+ shard_id,
+ manifest.seq,
+ incr_path.display()
+ );
+
+ let mut last_fsync = Instant::now();
+ let mut write_error = false;
+
+ loop {
+ match rx.recv() {
+ // PerShard writer (monoio): framed `[u64 lsn LE][u32 len LE][RESP]`.
+ // See the tokio twin above for format rationale.
+ Ok(AofMessage::Append { lsn, bytes: data }) => {
+ if write_error {
+ continue;
+ }
+ let mut header = [0u8; 12];
+ header[..8].copy_from_slice(&lsn.to_le_bytes());
+ header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes());
+ if let Err(e) = file.write_all(&header) {
+ error!(
+ "AOF header write failed shard {} (seq {}): {}. Persistence degraded.",
+ shard_id, manifest.seq, e
+ );
+ write_error = true;
+ continue;
+ }
+ if let Err(e) = file.write_all(&data) {
+ error!(
+ "AOF write failed shard {} (seq {}): {}. Persistence degraded.",
+ shard_id, manifest.seq, e
+ );
+ write_error = true;
+ continue;
+ }
+ match fsync {
+ FsyncPolicy::Always => {
+ let t = Instant::now();
+ if let Err(e) = file.flush().and_then(|_| file.sync_data()) {
+ error!(
+ "AOF sync failed shard {} (seq {}, always): {}",
+ shard_id, manifest.seq, e
+ );
+ write_error = true;
+ } else {
+ crate::admin::metrics_setup::record_aof_fsync(
+ t.elapsed().as_micros() as u64,
+ );
+ }
+ }
+ FsyncPolicy::EverySec => {
+ if last_fsync.elapsed() >= std::time::Duration::from_secs(1) {
+ let t = Instant::now();
+ if let Err(e) = file.flush().and_then(|_| file.sync_data()) {
+ error!(
+ "AOF sync failed shard {} (seq {}, everysec): {}",
+ shard_id, manifest.seq, e
+ );
+ } else {
+ crate::admin::metrics_setup::record_aof_fsync(
+ t.elapsed().as_micros() as u64,
+ );
+ last_fsync = Instant::now();
+ }
+ }
+ }
+ FsyncPolicy::No => {}
+ }
+ }
+ Ok(AofMessage::Rewrite(_)) | Ok(AofMessage::RewriteSharded(_)) => {
+ warn!(
+ "AOF writer shard {}: received Rewrite/RewriteSharded — \
+ not supported in PerShard layout, dropped. \
+ Per-shard BGREWRITEAOF lands in RFC step 6.",
+ shard_id
+ );
+ }
+ Ok(AofMessage::Shutdown) | Err(_) => {
+ if !write_error {
+ if let Err(e) = file.flush().and_then(|_| file.sync_data()) {
+ error!(
+ "AOF final sync failed shard {} (seq {}): {}",
+ shard_id, manifest.seq, e
+ );
+ }
+ }
+ info!(
+ "AOF writer shard {} shutting down (monoio, seq {})",
+ shard_id, manifest.seq
+ );
+ break;
+ }
+ }
+ }
+ }
+}
+
/// Replay an AOF file by parsing RESP commands and dispatching them.
///
/// Returns the number of commands successfully replayed.
@@ -789,7 +1470,9 @@ fn drain_pending_appends(
let mut outcome = DrainOutcome::default();
while let Ok(msg) = rx.try_recv() {
match msg {
- AofMessage::Append(data) => {
+ // BGREWRITEAOF drain runs on the TopLevel writer (monoio) only;
+ // PerShard rewrite is RFC step 6. Legacy v1 disk format → ignore lsn.
+ AofMessage::Append { bytes: data, lsn: _ } => {
file.write_all(&data).map_err(|e| AofError::Io {
path: PathBuf::from(""),
source: e,
diff --git a/src/persistence/aof_manifest.rs b/src/persistence/aof_manifest.rs
index ce1efb3b..c6109d87 100644
--- a/src/persistence/aof_manifest.rs
+++ b/src/persistence/aof_manifest.rs
@@ -5,17 +5,34 @@
//! manifest framing is the canonical on-disk marker; the human-readable
//! "v1" umbrella also covers WAL v3 and RDB v2 sub-formats.
//!
-//! Implements the same directory-based AOF format as Redis 7+:
+//! Two on-disk layouts coexist (selected at manifest creation time, never mixed
+//! within one directory):
+//!
+//! **TopLevel (manifest v1, single-shard / legacy):**
//! ```text
//! appendonlydir/
//! moon.aof.1.base.rdb # RDB snapshot base
//! moon.aof.1.incr.aof # Incremental RESP since base
-//! moon.aof.manifest # This file
+//! moon.aof.manifest # v1 text format
+//! ```
+//!
+//! **PerShard (manifest v2, multi-shard durability):**
+//! ```text
+//! appendonlydir/
+//! moon.aof.manifest # v2 text format (carries shard count + max_lsn)
+//! shard-0/
+//! moon.aof.1.base.rdb
+//! moon.aof.1.incr.aof
+//! shard-1/
+//! moon.aof.1.base.rdb
+//! moon.aof.1.incr.aof
+//! …
//! ```
//!
-//! The manifest is a simple text file listing the active base and incremental
-//! files with their sequence numbers. On BGREWRITEAOF, the sequence increments,
-//! a new base + incr pair is created, and old files are deleted.
+//! The manifest text format is line-prefix based. v1 manifests have no
+//! `version` line; v2 manifests begin with `version 2`. On BGREWRITEAOF the
+//! sequence increments, a new base + incr pair is created per shard (PerShard)
+//! or at top level (TopLevel), and old files are deleted.
use std::io::Write;
use std::path::{Path, PathBuf};
@@ -25,13 +42,44 @@ use tracing::{error, info, warn};
const MANIFEST_NAME: &str = "moon.aof.manifest";
const AOF_DIR_NAME: &str = "appendonlydir";
+/// On-disk layout discriminator.
+///
+/// `TopLevel` is the legacy single-shard layout from manifest v1. `PerShard`
+/// is the multi-shard layout introduced with manifest v2 — used whenever
+/// `num_shards >= 2`. A `--shards 1` deployment with an existing v1 manifest
+/// stays TopLevel until explicitly migrated.
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum AofLayout {
+ /// Legacy single-shard layout: `appendonlydir/moon.aof.{seq}.{base|incr}.*`.
+ TopLevel,
+ /// Per-shard layout: `appendonlydir/shard-{N}/moon.aof.{seq}.{base|incr}.*`.
+ PerShard,
+}
+
+/// Per-shard manifest entry. One per shard in `PerShard` layout.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct ShardManifest {
+ /// Shard ID (0..num_shards).
+ pub shard_id: u16,
+ /// Max LSN persisted to this shard's incr file so far. Semantics defined
+ /// by step 3 (LSN tagging) of the per-shard AOF RFC — until then this is
+ /// 0 and recovery does not use it. Once step 3 ships, recovery seeds
+ /// `master_repl_offset = max(shards[*].max_lsn)` before accepting writes.
+ pub max_lsn: u64,
+}
+
/// Active AOF file set tracked by the manifest.
#[derive(Debug, Clone)]
pub struct AofManifest {
/// Base directory (parent of `appendonlydir/`)
pub dir: PathBuf,
- /// Current sequence number (incremented on each rewrite)
+ /// Current sequence number (incremented on each rewrite).
pub seq: u64,
+ /// On-disk layout. Determines path computation for base/incr files.
+ pub layout: AofLayout,
+ /// Per-shard metadata. Length is 1 for `TopLevel`, `num_shards` for
+ /// `PerShard`. Indexed by `shard_id`.
+ pub shards: Vec,
}
impl AofManifest {
@@ -46,25 +94,76 @@ impl AofManifest {
}
/// Path to the base RDB file for the current sequence.
+ ///
+ /// Layout-aware: TopLevel returns `appendonlydir/moon.aof.{seq}.base.rdb`;
+ /// PerShard routes to `appendonlydir/shard-0/moon.aof.{seq}.base.rdb`.
+ /// This single-file helper is meaningful only when there is one shard
+ /// (post-migration `--shards 1`); a multi-shard PerShard manifest has N
+ /// base files and the caller must use [`Self::shard_base_path`] instead.
+ /// In debug builds, calling this on a multi-shard PerShard manifest
+ /// asserts; in release it returns the shard-0 path so production stays
+ /// recoverable rather than panicking on a stale call site.
pub fn base_path(&self) -> PathBuf {
- self.aof_dir()
- .join(format!("moon.aof.{}.base.rdb", self.seq))
+ match self.layout {
+ AofLayout::TopLevel => self
+ .aof_dir()
+ .join(format!("moon.aof.{}.base.rdb", self.seq)),
+ AofLayout::PerShard => {
+ debug_assert!(
+ self.shards.len() == 1,
+ "base_path() called on multi-shard PerShard manifest; use shard_base_path(shard_id)",
+ );
+ self.shard_base_path_seq(0, self.seq)
+ }
+ }
}
/// Path to the incremental RESP file for the current sequence.
+ ///
+ /// Layout-aware — see [`Self::base_path`] for the same routing rules.
pub fn incr_path(&self) -> PathBuf {
- self.aof_dir()
- .join(format!("moon.aof.{}.incr.aof", self.seq))
+ match self.layout {
+ AofLayout::TopLevel => self
+ .aof_dir()
+ .join(format!("moon.aof.{}.incr.aof", self.seq)),
+ AofLayout::PerShard => {
+ debug_assert!(
+ self.shards.len() == 1,
+ "incr_path() called on multi-shard PerShard manifest; use shard_incr_path(shard_id)",
+ );
+ self.shard_incr_path_seq(0, self.seq)
+ }
+ }
}
- /// Path to the base RDB file for a given sequence.
+ /// Path to the base RDB file for a given sequence. Layout-aware — see
+ /// [`Self::base_path`].
pub fn base_path_seq(&self, seq: u64) -> PathBuf {
- self.aof_dir().join(format!("moon.aof.{}.base.rdb", seq))
+ match self.layout {
+ AofLayout::TopLevel => self.aof_dir().join(format!("moon.aof.{}.base.rdb", seq)),
+ AofLayout::PerShard => {
+ debug_assert!(
+ self.shards.len() == 1,
+ "base_path_seq() called on multi-shard PerShard manifest; use shard_base_path_seq(shard_id, seq)",
+ );
+ self.shard_base_path_seq(0, seq)
+ }
+ }
}
- /// Path to the incremental RESP file for a given sequence.
+ /// Path to the incremental RESP file for a given sequence. Layout-aware —
+ /// see [`Self::base_path`].
pub fn incr_path_seq(&self, seq: u64) -> PathBuf {
- self.aof_dir().join(format!("moon.aof.{}.incr.aof", seq))
+ match self.layout {
+ AofLayout::TopLevel => self.aof_dir().join(format!("moon.aof.{}.incr.aof", seq)),
+ AofLayout::PerShard => {
+ debug_assert!(
+ self.shards.len() == 1,
+ "incr_path_seq() called on multi-shard PerShard manifest; use shard_incr_path_seq(shard_id, seq)",
+ );
+ self.shard_incr_path_seq(0, seq)
+ }
+ }
}
/// Create the `appendonlydir/` and write the initial manifest.
@@ -82,6 +181,11 @@ impl AofManifest {
let manifest = Self {
dir: dir.to_path_buf(),
seq: 1,
+ layout: AofLayout::TopLevel,
+ shards: vec![ShardManifest {
+ shard_id: 0,
+ max_lsn: 0,
+ }],
};
std::fs::create_dir_all(manifest.aof_dir())?;
@@ -119,6 +223,11 @@ impl AofManifest {
let manifest = Self {
dir: dir.to_path_buf(),
seq: 1,
+ layout: AofLayout::TopLevel,
+ shards: vec![ShardManifest {
+ shard_id: 0,
+ max_lsn: 0,
+ }],
};
std::fs::create_dir_all(manifest.aof_dir())?;
@@ -157,6 +266,53 @@ impl AofManifest {
let content = std::fs::read_to_string(&manifest_path)?;
+ // Detect format version. v1 manifests have no `version` line and use
+ // line prefixes `seq`/`base`/`incr`. v2 manifests start with `version 2`
+ // and carry per-shard records.
+ let mut format_version: u8 = 1;
+ for line in content.lines() {
+ let line = line.trim();
+ if let Some(val) = line.strip_prefix("version ") {
+ if let Ok(v) = val.parse::() {
+ format_version = v;
+ }
+ break;
+ }
+ if !line.is_empty() {
+ // First non-blank line is not a version header → v1.
+ break;
+ }
+ }
+
+ let manifest = match format_version {
+ 1 => Self::parse_v1(&content, dir, &manifest_path)?,
+ 2 => Self::parse_v2(&content, dir, &manifest_path)?,
+ other => {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has unsupported format version {} (max supported: 2)",
+ manifest_path.display(),
+ other,
+ ),
+ ));
+ }
+ };
+
+ // Best-effort orphan cleanup: delete stray base/incr files from aborted
+ // rewrites. A crash between advance() steps 1-3 leaves a new base RDB on
+ // disk that the active manifest never references. Without this sweep,
+ // repeated crashes during rewrite can fill the disk with zombie files.
+ //
+ // Safe to call here: parse_* verified the manifest has all required
+ // records, so cleanup_orphans won't delete the active files.
+ manifest.cleanup_orphans();
+
+ Ok(Some(manifest))
+ }
+
+ /// Parse a v1 (TopLevel, single-shard) manifest.
+ fn parse_v1(content: &str, dir: &Path, manifest_path: &Path) -> std::io::Result {
let mut seq = 0u64;
let mut has_base_record = false;
let mut has_incr_record = false;
@@ -183,10 +339,6 @@ impl AofManifest {
));
}
- // A valid manifest must have all three records (seq, base, incr).
- // A truncated manifest with only "seq N" but no base/incr lines could
- // trigger orphan cleanup that deletes the real base RDB referenced by
- // the previous valid manifest. Require all records before proceeding.
if !has_base_record || !has_incr_record {
return Err(std::io::Error::new(
std::io::ErrorKind::InvalidData,
@@ -200,21 +352,182 @@ impl AofManifest {
));
}
- let manifest = Self {
+ Ok(Self {
dir: dir.to_path_buf(),
seq,
- };
+ layout: AofLayout::TopLevel,
+ shards: vec![ShardManifest {
+ shard_id: 0,
+ max_lsn: 0,
+ }],
+ })
+ }
- // Best-effort orphan cleanup: delete stray base/incr files from aborted
- // rewrites. A crash between advance() steps 1-3 leaves a new base RDB on
- // disk that the active manifest never references. Without this sweep,
- // repeated crashes during rewrite can fill the disk with zombie files.
- //
- // Safe to call here: we verified the manifest has all three records
- // (seq, base, incr), so cleanup_orphans won't delete the active files.
- manifest.cleanup_orphans();
+ /// Parse a v2 (PerShard, multi-shard) manifest.
+ ///
+ /// Expected line format:
+ /// ```text
+ /// version 2
+ /// seq N
+ /// shards K
+ /// shard 0 max_lsn LSN0
+ /// shard 1 max_lsn LSN1
+ /// ...
+ /// ```
+ ///
+ /// Per-shard `base`/`incr` paths are derived from `shard-{N}/moon.aof.{seq}.*`
+ /// rather than stored explicitly — the layout is canonical, so storing
+ /// paths invites drift between the stored value and the computed one.
+ fn parse_v2(content: &str, dir: &Path, manifest_path: &Path) -> std::io::Result {
+ let mut seq = 0u64;
+ let mut num_shards: Option = None;
+ let mut shards: Vec = Vec::new();
- Ok(Some(manifest))
+ for line in content.lines() {
+ let line = line.trim();
+ if line.is_empty() || line.starts_with('#') {
+ continue;
+ }
+ if line == "version 2" {
+ continue;
+ } else if let Some(val) = line.strip_prefix("seq ") {
+ seq = val.parse::().map_err(|e| {
+ std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has invalid seq line `{}`: {}",
+ manifest_path.display(),
+ line,
+ e,
+ ),
+ )
+ })?;
+ } else if let Some(val) = line.strip_prefix("shards ") {
+ num_shards = Some(val.parse::().map_err(|e| {
+ std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has invalid shards line `{}`: {}",
+ manifest_path.display(),
+ line,
+ e,
+ ),
+ )
+ })?);
+ } else if let Some(rest) = line.strip_prefix("shard ") {
+ // Format: `shard max_lsn `
+ let mut it = rest.split_whitespace();
+ let id_str = it.next().ok_or_else(|| {
+ std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has shard line missing id: `{}`",
+ manifest_path.display(),
+ line,
+ ),
+ )
+ })?;
+ let id: u16 = id_str.parse().map_err(|e| {
+ std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has shard line invalid id `{}`: {}",
+ manifest_path.display(),
+ id_str,
+ e,
+ ),
+ )
+ })?;
+ // Expect `max_lsn `.
+ let label = it.next().unwrap_or("");
+ let val_str = it.next().unwrap_or("0");
+ if label != "max_lsn" {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} shard {} expected `max_lsn`, got `{}`",
+ manifest_path.display(),
+ id,
+ label,
+ ),
+ ));
+ }
+ let max_lsn: u64 = val_str.parse().map_err(|e| {
+ std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} shard {} invalid max_lsn `{}`: {}",
+ manifest_path.display(),
+ id,
+ val_str,
+ e,
+ ),
+ )
+ })?;
+ shards.push(ShardManifest {
+ shard_id: id,
+ max_lsn,
+ });
+ }
+ // Unknown lines are tolerated (forward-compat). Strict parsers can
+ // be added at v3 if needed.
+ }
+
+ if seq == 0 {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has no valid sequence number",
+ manifest_path.display()
+ ),
+ ));
+ }
+
+ let expected = num_shards.ok_or_else(|| {
+ std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} is missing required `shards N` line",
+ manifest_path.display()
+ ),
+ )
+ })?;
+
+ if shards.len() != expected as usize {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} declares shards={} but has {} shard records",
+ manifest_path.display(),
+ expected,
+ shards.len(),
+ ),
+ ));
+ }
+
+ // Sort by shard_id and verify contiguous range [0, expected).
+ shards.sort_by_key(|s| s.shard_id);
+ for (i, s) in shards.iter().enumerate() {
+ if s.shard_id as usize != i {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidData,
+ format!(
+ "AOF manifest at {} has non-contiguous shard ids (expected {} at position {}, got {})",
+ manifest_path.display(),
+ i,
+ i,
+ s.shard_id,
+ ),
+ ));
+ }
+ }
+
+ Ok(Self {
+ dir: dir.to_path_buf(),
+ seq,
+ layout: AofLayout::PerShard,
+ shards,
+ })
}
/// Delete any base/incr files in `appendonlydir/` that do not match the
@@ -258,14 +571,33 @@ impl AofManifest {
}
/// Write the manifest file atomically (write tmp + rename).
+ ///
+ /// Emits v1 format for `TopLevel` and v2 for `PerShard`. The format is
+ /// selected by `self.layout`, never by callers — preserving the invariant
+ /// that one directory holds one layout.
pub fn write_manifest(&self) -> std::io::Result<()> {
let manifest_path = self.manifest_path();
let tmp_path = manifest_path.with_extension("tmp");
- let content = format!(
- "seq {}\nbase moon.aof.{}.base.rdb\nincr moon.aof.{}.incr.aof\n",
- self.seq, self.seq, self.seq
- );
+ let content = match self.layout {
+ AofLayout::TopLevel => format!(
+ "seq {}\nbase moon.aof.{}.base.rdb\nincr moon.aof.{}.incr.aof\n",
+ self.seq, self.seq, self.seq
+ ),
+ AofLayout::PerShard => {
+ let mut s = String::with_capacity(64 + self.shards.len() * 40);
+ s.push_str("version 2\n");
+ s.push_str(&format!("seq {}\n", self.seq));
+ s.push_str(&format!("shards {}\n", self.shards.len()));
+ for shard in &self.shards {
+ s.push_str(&format!(
+ "shard {} max_lsn {}\n",
+ shard.shard_id, shard.max_lsn
+ ));
+ }
+ s
+ }
+ };
let mut f = std::fs::File::create(&tmp_path)?;
f.write_all(content.as_bytes())?;
@@ -274,6 +606,291 @@ impl AofManifest {
Ok(())
}
+ // ------------------------------------------------------------------
+ // Per-shard layout helpers
+ // ------------------------------------------------------------------
+
+ /// Directory holding a shard's AOF files.
+ ///
+ /// - `TopLevel`: `appendonlydir/` (the shard_id argument is asserted to be 0).
+ /// - `PerShard`: `appendonlydir/shard-{shard_id}/`.
+ pub fn shard_dir(&self, shard_id: u16) -> PathBuf {
+ match self.layout {
+ AofLayout::TopLevel => {
+ debug_assert_eq!(shard_id, 0, "TopLevel layout only has shard 0");
+ self.aof_dir()
+ }
+ AofLayout::PerShard => self.aof_dir().join(format!("shard-{}", shard_id)),
+ }
+ }
+
+ /// Path to a shard's base RDB file for the current sequence.
+ pub fn shard_base_path(&self, shard_id: u16) -> PathBuf {
+ self.shard_dir(shard_id)
+ .join(format!("moon.aof.{}.base.rdb", self.seq))
+ }
+
+ /// Path to a shard's incremental RESP file for the current sequence.
+ pub fn shard_incr_path(&self, shard_id: u16) -> PathBuf {
+ self.shard_dir(shard_id)
+ .join(format!("moon.aof.{}.incr.aof", self.seq))
+ }
+
+ /// Path to a shard's base RDB file for a given sequence.
+ pub fn shard_base_path_seq(&self, shard_id: u16, seq: u64) -> PathBuf {
+ self.shard_dir(shard_id)
+ .join(format!("moon.aof.{}.base.rdb", seq))
+ }
+
+ /// Path to a shard's incremental RESP file for a given sequence.
+ pub fn shard_incr_path_seq(&self, shard_id: u16, seq: u64) -> PathBuf {
+ self.shard_dir(shard_id)
+ .join(format!("moon.aof.{}.incr.aof", seq))
+ }
+
+ /// Maximum LSN persisted across all shards.
+ ///
+ /// Computed (not stored) so the stored value can never drift from
+ /// the per-shard records. Returns 0 if `shards` is empty (defensive;
+ /// constructors guarantee at least one shard).
+ pub fn global_max_lsn(&self) -> u64 {
+ self.shards.iter().map(|s| s.max_lsn).max().unwrap_or(0)
+ }
+
+ /// Verify that the manifest matches the runtime shard count.
+ ///
+ /// Returns the verbatim error from RFC § 3 if the shard count differs,
+ /// for operator-facing consistency. Callers (typically `main.rs` boot)
+ /// should treat this as fatal: continuing with a mismatched shard count
+ /// silently drops data from shards that no longer exist or replays a
+ /// shard's data into the wrong DashTable.
+ pub fn verify_shard_count(&self, expected: u16) -> Result<(), String> {
+ let actual = self.shards.len() as u16;
+ if actual != expected {
+ return Err(format!(
+ "ERR shard count changed (manifest={}, config={}); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md",
+ actual, expected
+ ));
+ }
+ Ok(())
+ }
+
+ /// Returns true if the on-disk layout under `appendonlydir/` matches the
+ /// legacy TopLevel format (files at top level, no `shard-N/` subdirs).
+ ///
+ /// Used by callers to detect when a v1 single-shard deployment is being
+ /// upgraded to v2 multi-shard and needs explicit migration. Does NOT
+ /// migrate — separate from `migrate_top_level_to_per_shard` so the side
+ /// effect is opt-in, not hidden in a load path.
+ pub fn is_legacy_top_level_layout(dir: &Path) -> bool {
+ let aof_dir = dir.join(AOF_DIR_NAME);
+ if !aof_dir.exists() {
+ return false;
+ }
+ let entries = match std::fs::read_dir(&aof_dir) {
+ Ok(e) => e,
+ Err(_) => return false,
+ };
+ for entry in entries.flatten() {
+ let name = entry.file_name();
+ let Some(name_str) = name.to_str() else {
+ continue;
+ };
+ if name_str.starts_with("moon.aof.")
+ && (name_str.ends_with(".base.rdb") || name_str.ends_with(".incr.aof"))
+ {
+ return true;
+ }
+ }
+ false
+ }
+
+ /// Migrate a single-shard TopLevel layout in place to a single-shard
+ /// PerShard layout.
+ ///
+ /// Moves `appendonlydir/moon.aof.{seq}.{base.rdb,incr.aof}` into
+ /// `appendonlydir/shard-0/`, then rewrites the manifest as v2 with
+ /// `shards 1`. Idempotent: a second call on an already-PerShard manifest
+ /// returns Ok with no filesystem changes.
+ ///
+ /// This is the RFC § 5 case 1 migration — zero data movement (rename only),
+ /// safe to run on first boot after upgrading from v0.1.x. Multi-shard
+ /// migrations from legacy AOF (case 2) use the `moon migrate-aof`
+ /// subcommand and are NOT handled here.
+ pub fn migrate_top_level_to_per_shard(&mut self) -> std::io::Result<()> {
+ if self.layout == AofLayout::PerShard {
+ return Ok(());
+ }
+ if self.shards.len() != 1 {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidInput,
+ format!(
+ "migrate_top_level_to_per_shard called with {} shards; \
+ only single-shard TopLevel can be migrated in place",
+ self.shards.len()
+ ),
+ ));
+ }
+
+ // Compute paths up front. shard_dir/shard_*_path_seq for a single-
+ // shard target are pure path computations and do NOT depend on
+ // self.layout, so it is safe to derive them while layout is still
+ // TopLevel.
+ let old_base = self.aof_dir().join(format!("moon.aof.{}.base.rdb", self.seq));
+ let old_incr = self.aof_dir().join(format!("moon.aof.{}.incr.aof", self.seq));
+ let new_dir = self.aof_dir().join("shard-0");
+ let new_base = new_dir.join(format!("moon.aof.{}.base.rdb", self.seq));
+ let new_incr = new_dir.join(format!("moon.aof.{}.incr.aof", self.seq));
+
+ if !old_base.exists() {
+ // Pre-flight check: nothing moved yet, no rollback needed.
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::NotFound,
+ format!(
+ "TopLevel→PerShard migration: source base {} not found",
+ old_base.display()
+ ),
+ ));
+ }
+ std::fs::create_dir_all(&new_dir)?;
+
+ // Move base. If this fails, no on-disk mutation happened yet — bail
+ // without rollback. Layout stays TopLevel until commit at the bottom.
+ std::fs::rename(&old_base, &new_base)?;
+
+ // Base is now in shard-0/. Any subsequent error must restore it.
+ let moved_incr: bool;
+ let created_incr: bool;
+ if old_incr.exists() {
+ if let Err(e) = std::fs::rename(&old_incr, &new_incr) {
+ if let Err(re) = std::fs::rename(&new_base, &old_base) {
+ error!(
+ "Migration rollback: failed to restore base {} → {}: {}",
+ new_base.display(),
+ old_base.display(),
+ re
+ );
+ }
+ return Err(e);
+ }
+ moved_incr = true;
+ created_incr = false;
+ } else {
+ match std::fs::File::create(&new_incr) {
+ Ok(_) => {
+ moved_incr = false;
+ created_incr = true;
+ }
+ Err(e) => {
+ if let Err(re) = std::fs::rename(&new_base, &old_base) {
+ error!(
+ "Migration rollback: failed to restore base {} → {}: {}",
+ new_base.display(),
+ old_base.display(),
+ re
+ );
+ }
+ return Err(e);
+ }
+ }
+ }
+
+ // Commit: flip layout, persist as v2. If write_manifest fails, undo
+ // every filesystem mutation and restore layout so the next boot still
+ // sees a valid v1 TopLevel deployment.
+ self.layout = AofLayout::PerShard;
+ if let Err(e) = self.write_manifest() {
+ self.layout = AofLayout::TopLevel;
+ if moved_incr {
+ if let Err(re) = std::fs::rename(&new_incr, &old_incr) {
+ error!(
+ "Migration rollback: failed to restore incr {} → {}: {}",
+ new_incr.display(),
+ old_incr.display(),
+ re
+ );
+ }
+ } else if created_incr {
+ if let Err(re) = std::fs::remove_file(&new_incr) {
+ warn!(
+ "Migration rollback: failed to remove freshly created incr {}: {}",
+ new_incr.display(),
+ re
+ );
+ }
+ }
+ if let Err(re) = std::fs::rename(&new_base, &old_base) {
+ error!(
+ "Migration rollback: failed to restore base {} → {}: {}. \
+ Manifest dir {} may be in an inconsistent state.",
+ new_base.display(),
+ old_base.display(),
+ re,
+ self.dir.display()
+ );
+ }
+ return Err(e);
+ }
+
+ info!(
+ "AOF migrated: TopLevel → PerShard (single shard) at {}",
+ self.aof_dir().display()
+ );
+ Ok(())
+ }
+
+ /// Create the `appendonlydir/` and write an initial v2 manifest for the
+ /// given shard count.
+ ///
+ /// Each shard gets its own `shard-{N}/` subdirectory with an empty base
+ /// RDB and an empty incr file. Mirrors `initialize()` semantics: the
+ /// `(base + incr)` invariant holds from the first boot, so recovery can
+ /// replay incr-only state without complaint.
+ pub fn initialize_multi(dir: &Path, num_shards: u16) -> std::io::Result {
+ if num_shards == 0 {
+ return Err(std::io::Error::new(
+ std::io::ErrorKind::InvalidInput,
+ "initialize_multi requires num_shards >= 1",
+ ));
+ }
+ let manifest = Self {
+ dir: dir.to_path_buf(),
+ seq: 1,
+ layout: AofLayout::PerShard,
+ shards: (0..num_shards)
+ .map(|id| ShardManifest {
+ shard_id: id,
+ max_lsn: 0,
+ })
+ .collect(),
+ };
+ std::fs::create_dir_all(manifest.aof_dir())?;
+
+ // Per-shard empty RDB. Single Database::default() inside a 1-element
+ // slice matches `initialize()`'s empty-RDB shape for each shard.
+ let empty_dbs: [crate::storage::Database; 0] = [];
+ let empty_rdb = crate::persistence::rdb::save_to_bytes(&empty_dbs)
+ .map_err(|e| std::io::Error::other(format!("empty RDB serialize: {e}")))?;
+
+ for shard_id in 0..num_shards {
+ let shard_dir = manifest.shard_dir(shard_id);
+ std::fs::create_dir_all(&shard_dir)?;
+
+ let base_path = manifest.shard_base_path(shard_id);
+ let tmp_path = base_path.with_extension("rdb.tmp");
+ {
+ let mut f = std::fs::File::create(&tmp_path)?;
+ f.write_all(&empty_rdb)?;
+ f.sync_data()?;
+ }
+ std::fs::rename(&tmp_path, &base_path)?;
+ std::fs::File::create(manifest.shard_incr_path(shard_id))?;
+ }
+
+ manifest.write_manifest()?;
+ Ok(manifest)
+ }
+
/// Advance to the next sequence: write new base RDB, create new incr file,
/// update manifest, delete old files.
///
@@ -517,3 +1134,301 @@ fn replay_incr_resp(
Ok(count)
}
+
+#[cfg(test)]
+mod tests_v2 {
+ //! Unit tests for the v2 (PerShard) manifest format.
+ //!
+ //! Covers the Step 1 deliverable of the per-shard AOF RFC:
+ //! - v1 manifests continue to load as TopLevel (single-shard, shard_id=0)
+ //! - v2 round-trip: write → load → equivalent struct shape
+ //! - shard count mismatch produces the verbatim RFC § 3 error
+ //! - migrate_top_level_to_per_shard performs in-place rename and rewrites
+ //! the manifest as v2
+ //! - global_max_lsn computes max across shards
+ //! - is_legacy_top_level_layout detects top-level files
+
+ use super::*;
+ use std::fs;
+
+ fn temp_dir() -> PathBuf {
+ let d = std::env::temp_dir().join(format!(
+ "moon-aof-manifest-test-{}-{}",
+ std::process::id(),
+ std::time::SystemTime::now()
+ .duration_since(std::time::UNIX_EPOCH)
+ .map(|d| d.as_nanos())
+ .unwrap_or(0)
+ ));
+ fs::create_dir_all(&d).expect("temp dir create");
+ d
+ }
+
+ #[test]
+ fn v1_manifest_loads_as_top_level_single_shard() {
+ let dir = temp_dir();
+ let m = AofManifest::initialize(&dir).expect("initialize v1");
+
+ assert_eq!(m.layout, AofLayout::TopLevel);
+ assert_eq!(m.shards.len(), 1);
+ assert_eq!(m.shards[0].shard_id, 0);
+ assert_eq!(m.shards[0].max_lsn, 0);
+
+ // Reload from disk
+ let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+ assert_eq!(reloaded.layout, AofLayout::TopLevel);
+ assert_eq!(reloaded.shards.len(), 1);
+ assert_eq!(reloaded.seq, m.seq);
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn v2_manifest_round_trips() {
+ let dir = temp_dir();
+ let m = AofManifest::initialize_multi(&dir, 4).expect("initialize_multi");
+
+ assert_eq!(m.layout, AofLayout::PerShard);
+ assert_eq!(m.shards.len(), 4);
+ for (i, s) in m.shards.iter().enumerate() {
+ assert_eq!(s.shard_id, i as u16);
+ assert_eq!(s.max_lsn, 0);
+ }
+
+ // Per-shard subdirs were created with empty base + incr.
+ for i in 0..4u16 {
+ assert!(m.shard_dir(i).exists(), "shard-{} dir exists", i);
+ assert!(m.shard_base_path(i).exists(), "shard-{} base exists", i);
+ assert!(m.shard_incr_path(i).exists(), "shard-{} incr exists", i);
+ }
+
+ let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+ assert_eq!(reloaded.layout, AofLayout::PerShard);
+ assert_eq!(reloaded.shards.len(), 4);
+ assert_eq!(reloaded.seq, m.seq);
+ for (i, s) in reloaded.shards.iter().enumerate() {
+ assert_eq!(s.shard_id, i as u16);
+ }
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn verify_shard_count_emits_rfc_error_verbatim() {
+ let m = AofManifest {
+ dir: PathBuf::from("/tmp/nowhere"),
+ seq: 1,
+ layout: AofLayout::PerShard,
+ shards: vec![
+ ShardManifest { shard_id: 0, max_lsn: 0 },
+ ShardManifest { shard_id: 1, max_lsn: 0 },
+ ],
+ };
+ let err = m.verify_shard_count(4).expect_err("should mismatch");
+ assert_eq!(
+ err,
+ "ERR shard count changed (manifest=2, config=4); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md"
+ );
+
+ // Matching count succeeds.
+ m.verify_shard_count(2).expect("match");
+ }
+
+ #[test]
+ fn migrate_top_level_to_per_shard_moves_files_and_rewrites_manifest() {
+ let dir = temp_dir();
+ let mut m = AofManifest::initialize(&dir).expect("initialize v1");
+
+ // Write a marker into the incr file so we can prove the contents
+ // survive the rename.
+ let original_incr = m.aof_dir().join(format!("moon.aof.{}.incr.aof", m.seq));
+ fs::write(&original_incr, b"MARKER").expect("write incr marker");
+
+ m.migrate_top_level_to_per_shard().expect("migrate");
+
+ assert_eq!(m.layout, AofLayout::PerShard);
+ assert!(!original_incr.exists(), "old incr removed by rename");
+ let new_incr = m.shard_incr_path(0);
+ assert!(new_incr.exists(), "new shard-0 incr exists");
+ let contents = fs::read(&new_incr).expect("read new incr");
+ assert_eq!(contents, b"MARKER", "incr contents preserved");
+
+ // Reloaded manifest is v2.
+ let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+ assert_eq!(reloaded.layout, AofLayout::PerShard);
+ assert_eq!(reloaded.shards.len(), 1);
+
+ // Idempotency: second call is a no-op.
+ m.migrate_top_level_to_per_shard().expect("idempotent");
+ assert_eq!(m.layout, AofLayout::PerShard);
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn global_max_lsn_returns_max_across_shards() {
+ let m = AofManifest {
+ dir: PathBuf::from("/tmp/nowhere"),
+ seq: 1,
+ layout: AofLayout::PerShard,
+ shards: vec![
+ ShardManifest { shard_id: 0, max_lsn: 100 },
+ ShardManifest { shard_id: 1, max_lsn: 500 },
+ ShardManifest { shard_id: 2, max_lsn: 250 },
+ ],
+ };
+ assert_eq!(m.global_max_lsn(), 500);
+ }
+
+ #[test]
+ fn is_legacy_top_level_layout_detects_v1_files() {
+ let dir = temp_dir();
+ // No appendonlydir yet → false.
+ assert!(!AofManifest::is_legacy_top_level_layout(&dir));
+
+ // After v1 initialize, top-level files present → true.
+ let _m = AofManifest::initialize(&dir).expect("init v1");
+ assert!(AofManifest::is_legacy_top_level_layout(&dir));
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn is_legacy_top_level_layout_returns_false_for_v2() {
+ let dir = temp_dir();
+ let _m = AofManifest::initialize_multi(&dir, 2).expect("init v2");
+ assert!(
+ !AofManifest::is_legacy_top_level_layout(&dir),
+ "v2 layout has no top-level moon.aof.* files"
+ );
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn parse_v2_rejects_shard_count_mismatch_in_file() {
+ let dir = temp_dir();
+ let aof = dir.join(AOF_DIR_NAME);
+ fs::create_dir_all(&aof).unwrap();
+ // Manifest claims shards 3 but only declares two shard records.
+ fs::write(
+ aof.join(MANIFEST_NAME),
+ "version 2\nseq 1\nshards 3\nshard 0 max_lsn 0\nshard 1 max_lsn 0\n",
+ )
+ .unwrap();
+
+ let err = AofManifest::load(&dir).expect_err("should reject");
+ let msg = err.to_string();
+ assert!(
+ msg.contains("declares shards=3 but has 2 shard records"),
+ "got: {}",
+ msg
+ );
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn parse_v2_rejects_non_contiguous_shard_ids() {
+ let dir = temp_dir();
+ let aof = dir.join(AOF_DIR_NAME);
+ fs::create_dir_all(&aof).unwrap();
+ // shards=2 but ids are {0, 2} not {0, 1}.
+ fs::write(
+ aof.join(MANIFEST_NAME),
+ "version 2\nseq 1\nshards 2\nshard 0 max_lsn 0\nshard 2 max_lsn 0\n",
+ )
+ .unwrap();
+
+ let err = AofManifest::load(&dir).expect_err("should reject");
+ let msg = err.to_string();
+ assert!(msg.contains("non-contiguous shard ids"), "got: {}", msg);
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ // ------------------------------------------------------------------
+ // Reviewer-flagged fixes: layout-aware path helpers + migration
+ // rollback. See the "Verify findings against current code" review
+ // comment on aof_manifest.rs:669-775 and :688-717.
+ // ------------------------------------------------------------------
+
+ #[test]
+ fn base_incr_paths_route_to_shard_zero_after_migration() {
+ let dir = temp_dir();
+ let mut m = AofManifest::initialize(&dir).expect("init v1");
+ // Pre-migration: TopLevel paths under appendonlydir/ directly.
+ assert_eq!(m.base_path(), m.aof_dir().join("moon.aof.1.base.rdb"));
+ assert_eq!(m.incr_path(), m.aof_dir().join("moon.aof.1.incr.aof"));
+
+ m.migrate_top_level_to_per_shard().expect("migrate");
+
+ // Post-migration: single-file helpers must route to shard-0/ so
+ // replay_multi_part and advance() find the actual files. This is
+ // the bug the reviewer flagged for aof_manifest.rs:669-775.
+ let shard0 = m.aof_dir().join("shard-0");
+ assert_eq!(m.base_path(), shard0.join("moon.aof.1.base.rdb"));
+ assert_eq!(m.incr_path(), shard0.join("moon.aof.1.incr.aof"));
+ assert_eq!(m.base_path_seq(7), shard0.join("moon.aof.7.base.rdb"));
+ assert_eq!(m.incr_path_seq(7), shard0.join("moon.aof.7.incr.aof"));
+ // The path the helper returns must be where the file actually lives.
+ assert!(m.base_path().exists(), "base file at returned path");
+ assert!(m.incr_path().exists(), "incr file at returned path");
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn migrate_rolls_back_filesystem_when_incr_rename_fails() {
+ // Simulate the rename(old_incr → new_incr) failure path by making
+ // the destination already exist as a directory (rename onto a
+ // non-empty directory is an error on every supported OS).
+ let dir = temp_dir();
+ let mut m = AofManifest::initialize(&dir).expect("init v1");
+ let original_base = m.aof_dir().join("moon.aof.1.base.rdb");
+ let original_incr = m.aof_dir().join("moon.aof.1.incr.aof");
+ fs::write(&original_incr, b"INCR_MARKER").expect("seed incr");
+ let base_bytes_before = fs::read(&original_base).expect("read base");
+
+ // Pre-create shard-0/moon.aof.1.incr.aof as a DIRECTORY so the
+ // rename fails after the base rename has already succeeded.
+ let shard0 = m.aof_dir().join("shard-0");
+ fs::create_dir_all(shard0.join("moon.aof.1.incr.aof")).expect("seed blocker");
+
+ let err = m
+ .migrate_top_level_to_per_shard()
+ .expect_err("incr rename should fail");
+ let _ = err; // exact error kind depends on OS
+
+ // Rollback invariants:
+ // 1. Layout stays TopLevel in memory.
+ // 2. base file restored to its original TopLevel path.
+ // 3. base file contents unchanged.
+ // 4. on-disk manifest is still v1 (load returns layout TopLevel).
+ assert_eq!(m.layout, AofLayout::TopLevel, "in-memory layout reverted");
+ assert!(original_base.exists(), "base restored to top-level");
+ let base_bytes_after = fs::read(&original_base).expect("read base");
+ assert_eq!(base_bytes_after, base_bytes_before, "base contents intact");
+ let reloaded = AofManifest::load(&dir).expect("load").expect("present");
+ assert_eq!(reloaded.layout, AofLayout::TopLevel, "on-disk manifest v1");
+
+ fs::remove_dir_all(&dir).ok();
+ }
+
+ #[test]
+ fn migrate_does_not_mutate_on_missing_base() {
+ let dir = temp_dir();
+ let mut m = AofManifest::initialize(&dir).expect("init v1");
+ let base = m.aof_dir().join("moon.aof.1.base.rdb");
+ fs::remove_file(&base).expect("remove base");
+
+ let err = m
+ .migrate_top_level_to_per_shard()
+ .expect_err("missing base should fail");
+ assert_eq!(err.kind(), std::io::ErrorKind::NotFound);
+ // Layout never flipped, no rollback needed.
+ assert_eq!(m.layout, AofLayout::TopLevel);
+
+ fs::remove_dir_all(&dir).ok();
+ }
+}
diff --git a/src/replication/state.rs b/src/replication/state.rs
index e5e357c0..1a596faa 100644
--- a/src/replication/state.rs
+++ b/src/replication/state.rs
@@ -113,10 +113,34 @@ impl ReplicationState {
/// Increment the offset for the given shard by delta bytes.
/// Also adds delta to master_repl_offset.
pub fn increment_shard_offset(&self, shard_id: usize, delta: u64) {
- if shard_id < self.shard_offsets.len() {
- self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed);
- self.master_repl_offset.fetch_add(delta, Ordering::Relaxed);
+ let _ = self.issue_lsn(shard_id, delta);
+ }
+
+ /// Atomically issue an LSN for a write and advance per-shard +
+ /// master replication offsets by `delta`.
+ ///
+ /// Returns the LSN that uniquely identifies this write — equal to the
+ /// value of `master_repl_offset` BEFORE the increment, mirroring Redis's
+ /// `+ delta - delta` semantics. The same LSN MUST tag the corresponding
+ /// `AofMessage::Append` entry and the replication backlog entry for that
+ /// write so per-shard AOF replay can rebuild a globally consistent log
+ /// (per-shard AOF RFC § 2 Rule 3).
+ ///
+ /// Atomicity caveat: the per-shard offset advance and the master offset
+ /// advance are TWO separate `fetch_add`s, not one composite op. Concurrent
+ /// callers across shards observe a brief window where the master sum
+ /// disagrees with the sum of shard offsets. Acceptable today because the
+ /// only `total_offset()` consumer is INFO replication, which tolerates
+ /// transient skew. Do not promote to a hard invariant without redesign.
+ ///
+ /// Returns 0 if `shard_id` is out of range (defensive; production callers
+ /// must pass a valid id).
+ pub fn issue_lsn(&self, shard_id: usize, delta: u64) -> u64 {
+ if shard_id >= self.shard_offsets.len() {
+ return 0;
}
+ self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed);
+ self.master_repl_offset.fetch_add(delta, Ordering::Relaxed)
}
/// Returns sum of all per-shard offsets.
diff --git a/src/server/conn/blocking.rs b/src/server/conn/blocking.rs
index 8846d0c8..7cf92727 100644
--- a/src/server/conn/blocking.rs
+++ b/src/server/conn/blocking.rs
@@ -1135,7 +1135,10 @@ pub(crate) fn try_inline_dispatch(
shard_databases: &std::sync::Arc,
shard_id: usize,
selected_db: usize,
- aof_tx: &Option>,
+ aof_pool: &Option>,
+ repl_state: &Option<
+ std::sync::Arc>,
+ >,
now_ms: u64,
num_shards: usize,
can_inline_writes: bool,
@@ -1346,8 +1349,20 @@ pub(crate) fn try_inline_dispatch(
}
// AOF: reuse the frozen RESP bytes directly (Arc clone, zero-copy).
- if let Some(tx) = aof_tx {
- let _ = tx.try_send(crate::persistence::aof::AofMessage::Append(frozen));
+ // This path is monoio inline GET/SET — the writer for the local shard
+ // (shard_id) owns the AOF record; under PerShard layout that routes
+ // to shard_id's writer. LSN must be sourced from `repl_state` so the
+ // inline path's writes share an LSN namespace with the non-inline path
+ // — otherwise per-shard replay merge in RFC step 5 would see two
+ // disjoint LSN streams per shard. Cost: one extra read-lock acquire
+ // (uncontended) + one atomic fetch_add per inline SET.
+ if let Some(pool) = aof_pool {
+ let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(
+ repl_state,
+ shard_id,
+ frozen.len(),
+ );
+ pool.try_send_append(shard_id, lsn, frozen);
}
write_buf.extend_from_slice(b"+OK\r\n");
@@ -1363,7 +1378,10 @@ pub(crate) fn try_inline_dispatch_loop(
shard_databases: &std::sync::Arc,
shard_id: usize,
selected_db: usize,
- aof_tx: &Option>,
+ aof_pool: &Option>,
+ repl_state: &Option<
+ std::sync::Arc>,
+ >,
now_ms: u64,
num_shards: usize,
can_inline_writes: bool,
@@ -1377,7 +1395,8 @@ pub(crate) fn try_inline_dispatch_loop(
shard_databases,
shard_id,
selected_db,
- aof_tx,
+ aof_pool,
+ repl_state,
now_ms,
num_shards,
can_inline_writes,
diff --git a/src/server/conn/core.rs b/src/server/conn/core.rs
index 43cef452..dc2e2365 100644
--- a/src/server/conn/core.rs
+++ b/src/server/conn/core.rs
@@ -20,7 +20,7 @@ use std::sync::Arc;
use crate::acl::{AclLog, AclTable};
use crate::blocking::BlockingRegistry;
use crate::config::{RuntimeConfig, ServerConfig};
-use crate::persistence::aof::AofMessage;
+use crate::persistence::aof::AofWriterPool;
use crate::protocol::Frame;
use crate::pubsub::PubSubRegistry;
use crate::runtime::channel;
@@ -47,7 +47,13 @@ pub(crate) struct ConnectionContext {
pub pubsub_registry: Arc>,
pub blocking_registry: Rc>,
pub requirepass: Option,
- pub aof_tx: Option>,
+ /// AOF writer pool — the **sole AOF interface** after the 2d/2e migration
+ /// sequence. Built by spawn sites in `shard/conn_accept.rs` from the
+ /// on-disk manifest layout: TopLevel wraps a single shared writer,
+ /// PerShard owns one sender per shard. `try_send_append(shard_id, bytes)`
+ /// routes to the owning shard; `try_send_rewrite(msg)` rejects under
+ /// PerShard until per-shard rewrite ships (step 6 of the RFC).
+ pub aof_pool: Option>,
pub tracking_table: Rc>,
pub repl_state: Option>>,
/// Lock-free mirror of `repl_state.role == Replica { .. }`.
@@ -96,7 +102,7 @@ impl ConnectionContext {
pubsub_registry: Arc>,
blocking_registry: Rc>,
requirepass: Option,
- aof_tx: Option>,
+ aof_pool: Option>,
tracking_table: Rc>,
repl_state: Option>>,
cluster_state: Option>>,
@@ -136,7 +142,7 @@ impl ConnectionContext {
pubsub_registry,
blocking_registry,
requirepass,
- aof_tx,
+ aof_pool,
tracking_table,
repl_state,
is_replica_mirror,
diff --git a/src/server/conn/handler_monoio/dispatch.rs b/src/server/conn/handler_monoio/dispatch.rs
index 7182cd68..b4e5485f 100644
--- a/src/server/conn/handler_monoio/dispatch.rs
+++ b/src/server/conn/handler_monoio/dispatch.rs
@@ -978,9 +978,9 @@ pub(super) fn try_handle_persistence(
return true;
}
if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") {
- if let Some(ref tx) = ctx.aof_tx {
+ if let Some(ref pool) = ctx.aof_pool {
responses.push(crate::command::persistence::bgrewriteaof_start_sharded(
- tx,
+ pool,
ctx.shard_databases.clone(),
));
} else {
diff --git a/src/server/conn/handler_monoio/mod.rs b/src/server/conn/handler_monoio/mod.rs
index 72eb1413..f2d8debb 100644
--- a/src/server/conn/handler_monoio/mod.rs
+++ b/src/server/conn/handler_monoio/mod.rs
@@ -20,7 +20,7 @@ use std::rc::Rc;
use crate::command::metadata;
use crate::command::{DispatchResult, dispatch, dispatch_read, is_dispatch_read_supported};
-use crate::persistence::aof::{self, AofMessage};
+use crate::persistence::aof;
use crate::protocol::Frame;
use crate::shard::dispatch::key_to_shard;
use crate::shard::mesh::ChannelMesh;
@@ -483,7 +483,8 @@ pub(crate) async fn handle_connection_sharded_monoio<
&ctx.shard_databases,
ctx.shard_id,
conn.selected_db,
- &ctx.aof_tx,
+ &ctx.aof_pool,
+ &ctx.repl_state,
ctx.cached_clock.ms(),
ctx.num_shards,
can_inline_writes,
@@ -1066,7 +1067,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
}
// Pre-classify write commands for AOF + tracking
- let is_write = if ctx.aof_tx.is_some() || conn.tracking_state.enabled {
+ let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled {
metadata::is_write(cmd)
} else {
false
@@ -1121,9 +1122,14 @@ pub(crate) async fn handle_connection_sharded_monoio<
};
// AOF only on actual success (:1). Matches handler_single.
if matches!(response, Frame::Integer(1)) {
- if let Some(ref tx) = ctx.aof_tx {
+ if let Some(ref pool) = ctx.aof_pool {
let serialized = aof::serialize_command(&frame);
- let _ = tx.try_send(AofMessage::Append(serialized));
+ let lsn = aof::AofWriterPool::issue_append_lsn(
+ &ctx.repl_state,
+ ctx.shard_id,
+ serialized.len(),
+ );
+ pool.try_send_append(ctx.shard_id, lsn, serialized);
}
}
responses.push(response);
@@ -1186,9 +1192,14 @@ pub(crate) async fn handle_connection_sharded_monoio<
// AOF only on actual success (:1). Matches handler_single
// — `:0` (key absent / dst exists w/o REPLACE) is a no-op.
if matches!(response, Frame::Integer(1)) {
- if let Some(ref tx) = ctx.aof_tx {
+ if let Some(ref pool) = ctx.aof_pool {
let serialized = aof::serialize_command(&frame);
- let _ = tx.try_send(AofMessage::Append(serialized));
+ let lsn = aof::AofWriterPool::issue_append_lsn(
+ &ctx.repl_state,
+ ctx.shard_id,
+ serialized.len(),
+ );
+ pool.try_send_append(ctx.shard_id, lsn, serialized);
}
}
responses.push(response);
@@ -1535,9 +1546,14 @@ pub(crate) async fn handle_connection_sharded_monoio<
// AOF logging for successful local writes
if !matches!(response, Frame::Error(_)) && is_write {
- if let Some(ref tx) = ctx.aof_tx {
+ if let Some(ref pool) = ctx.aof_pool {
let serialized = aof::serialize_command(&frame);
- let _ = tx.try_send(AofMessage::Append(serialized));
+ let lsn = aof::AofWriterPool::issue_append_lsn(
+ &ctx.repl_state,
+ ctx.shard_id,
+ serialized.len(),
+ );
+ pool.try_send_append(ctx.shard_id, lsn, serialized);
}
}
@@ -1768,7 +1784,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
let resp_idx = responses.len();
responses.push(Frame::Null); // placeholder, filled after batch dispatch
// Pre-compute AOF bytes before moving frame into Arc
- let aof_bytes = if ctx.aof_tx.is_some() && metadata::is_write(cmd) {
+ let aof_bytes = if ctx.aof_pool.is_some() && metadata::is_write(cmd) {
Some(aof::serialize_command(&dispatch_frame))
} else {
None
@@ -1837,7 +1853,11 @@ pub(crate) async fn handle_connection_sharded_monoio<
if !remote_groups.is_empty() {
reply_futures.clear();
+ // Capture `target` per batch so the cross-shard AOF write at the bottom
+ // of the loop can route to the owning shard's pool (not ctx.shard_id —
+ // mirrors the load-bearing fix at handler_sharded/mod.rs:1651).
let mut oneshot_futures: Vec<(
+ usize, // target shard — owner for AOF append
Vec<(usize, Option, Bytes)>,
channel::OneshotReceiver>,
)> = Vec::new();
@@ -1881,7 +1901,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
}
}
}
- oneshot_futures.push((meta, reply_rx));
+ oneshot_futures.push((target, meta, reply_rx));
}
// Poll all shard responses via pending_wakers relay (monoio cross-thread waker fix).
@@ -1889,7 +1909,7 @@ pub(crate) async fn handle_connection_sharded_monoio<
// Instead, the connection task registers its waker in pending_wakers; the event
// loop drains and wakes them after every SPSC cycle (~1ms). On wake, try_recv()
// checks if the response arrived; if not, re-register and yield again.
- for (meta, reply_rx) in oneshot_futures.drain(..) {
+ for (target, meta, reply_rx) in oneshot_futures.drain(..) {
tracing::trace!(
"Shard {}: awaiting cross-shard response via pending_wakers",
ctx.shard_id
@@ -1931,11 +1951,24 @@ pub(crate) async fn handle_connection_sharded_monoio<
};
for ((resp_idx, aof_bytes, cmd_name), resp) in meta.into_iter().zip(shard_responses)
{
- // AOF logging for successful remote writes
+ // AOF logging for successful remote writes.
+ // Owner shard is `target` (NOT ctx.shard_id) — under PerShard
+ // layout the write must land in the target shard's AOF file
+ // since that shard owns the mutated data. Mirrors the
+ // load-bearing fix at handler_sharded/mod.rs:1651.
if let Some(bytes) = aof_bytes {
if !matches!(resp, Frame::Error(_)) {
- if let Some(ref tx) = ctx.aof_tx {
- let _ = tx.try_send(AofMessage::Append(bytes));
+ if let Some(ref pool) = ctx.aof_pool {
+ // Cross-shard write: LSN must be sourced
+ // using `target`'s shard_id so the
+ // per-shard offset increment lands on the
+ // shard that owns the mutated data.
+ let lsn = aof::AofWriterPool::issue_append_lsn(
+ &ctx.repl_state,
+ target,
+ bytes.len(),
+ );
+ pool.try_send_append(target, lsn, bytes);
}
}
}
diff --git a/src/server/conn/handler_sharded/dispatch.rs b/src/server/conn/handler_sharded/dispatch.rs
index e73b77f2..bcd4e98a 100644
--- a/src/server/conn/handler_sharded/dispatch.rs
+++ b/src/server/conn/handler_sharded/dispatch.rs
@@ -353,9 +353,9 @@ pub(super) fn try_handle_persistence(
return true;
}
if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") {
- if let Some(ref tx) = ctx.aof_tx {
+ if let Some(ref pool) = ctx.aof_pool {
responses.push(crate::command::persistence::bgrewriteaof_start_sharded(
- tx,
+ pool,
ctx.shard_databases.clone(),
));
} else {
diff --git a/src/server/conn/handler_sharded/mod.rs b/src/server/conn/handler_sharded/mod.rs
index 83256ca8..7f41d655 100644
--- a/src/server/conn/handler_sharded/mod.rs
+++ b/src/server/conn/handler_sharded/mod.rs
@@ -15,7 +15,7 @@ use std::collections::HashMap;
use crate::command::connection as conn_cmd;
use crate::command::metadata;
use crate::command::{DispatchResult, dispatch, dispatch_read};
-use crate::persistence::aof::{self, AofMessage};
+use crate::persistence::aof;
use crate::protocol::Frame;
use crate::shard::dispatch::{ShardMessage, key_to_shard};
use crate::shard::mesh::ChannelMesh;
@@ -1119,8 +1119,8 @@ pub(crate) async fn handle_connection_sharded_inner<
}
}
- let is_write = if ctx.aof_tx.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false };
- let aof_bytes = if is_write && ctx.aof_tx.is_some() { Some(aof::serialize_command(&frame)) } else { None };
+ let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false };
+ let aof_bytes = if is_write && ctx.aof_pool.is_some() { Some(aof::serialize_command(&frame)) } else { None };
if is_local {
// LOCAL PATH: split into read/write to avoid exclusive lock on reads.
@@ -1172,7 +1172,10 @@ pub(crate) async fn handle_connection_sharded_inner<
// — `:0` (key absent) is a no-op and must not log.
if matches!(response, Frame::Integer(1)) {
if let Some(ref bytes) = aof_bytes {
- if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes.clone())); }
+ if let Some(ref pool) = ctx.aof_pool {
+ let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len());
+ pool.try_send_append(ctx.shard_id, lsn, bytes.clone());
+ }
}
}
responses.push(response);
@@ -1216,7 +1219,10 @@ pub(crate) async fn handle_connection_sharded_inner<
// — `:0` (key absent / dst exists w/o REPLACE) is a no-op.
if matches!(response, Frame::Integer(1)) {
if let Some(ref bytes) = aof_bytes {
- if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes.clone())); }
+ if let Some(ref pool) = ctx.aof_pool {
+ let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len());
+ pool.try_send_append(ctx.shard_id, lsn, bytes.clone());
+ }
}
}
responses.push(response);
@@ -1427,7 +1433,10 @@ pub(crate) async fn handle_connection_sharded_inner<
}
if let Some(bytes) = aof_bytes {
if !matches!(response, Frame::Error(_)) {
- if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); }
+ if let Some(ref pool) = ctx.aof_pool {
+ let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len());
+ pool.try_send_append(ctx.shard_id, lsn, bytes);
+ }
}
}
if conn.tracking_state.enabled && !matches!(response, Frame::Error(_)) {
@@ -1646,9 +1655,20 @@ pub(crate) async fn handle_connection_sharded_inner<
for (meta, target) in reply_futures {
let shard_responses = response_pool.future_for(target).await;
for ((resp_idx, aof_bytes, cmd_name), resp) in meta.into_iter().zip(shard_responses) {
+ // AOF logging for successful remote writes.
+ // Owner shard is `target` (NOT ctx.shard_id) — under PerShard
+ // layout the write must land in the target shard's AOF file
+ // since that shard owns the mutated data. This was the
+ // pre-existing routing bug that motivated the per-shard AOF
+ // RFC (Option B): under TopLevel a single writer absorbed
+ // every cross-shard append, masking the wrong-owner write.
if let Some(bytes) = aof_bytes {
if !matches!(resp, Frame::Error(_)) {
- if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); }
+ if let Some(ref pool) = ctx.aof_pool {
+ // Cross-shard: LSN sourced for `target`.
+ let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, target, bytes.len());
+ pool.try_send_append(target, lsn, bytes);
+ }
}
}
responses[resp_idx] = apply_resp3_conversion(&cmd_name, resp, proto_ver);
diff --git a/src/server/conn/handler_single.rs b/src/server/conn/handler_single.rs
index 4a626cc7..6aa4f3ad 100644
--- a/src/server/conn/handler_single.rs
+++ b/src/server/conn/handler_single.rs
@@ -42,7 +42,10 @@ use crate::server::codec::RespCodec;
/// When `requirepass` is set, clients must authenticate via AUTH before any other
/// commands are accepted (except QUIT).
///
-/// When `aof_tx` is provided, write commands are logged to the AOF file.
+/// When `aof_pool` is provided, write commands are logged via the per-shard
+/// AOF writer pool. handler_single is single-shard mode by definition
+/// (num_shards = 1, shard_id = 0), so the pool is always a TopLevel layout
+/// wrapping a single writer sender — see `AofWriterPool::top_level`.
/// When `change_counter` is provided, write commands increment the counter for auto-save.
///
/// Supports Pub/Sub subscriber mode: when a client subscribes to channels/patterns,
@@ -61,7 +64,7 @@ pub async fn handle_connection(
shutdown: CancellationToken,
requirepass: Option,
config: Arc,
- aof_tx: Option>,
+ aof_pool: Option>,
change_counter: Option>,
pubsub_registry: Arc>,
runtime_config: Arc>,
@@ -608,8 +611,8 @@ pub async fn handle_connection(
}
// BGREWRITEAOF
if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") {
- let response = if let Some(ref tx) = aof_tx {
- crate::command::persistence::bgrewriteaof_start(tx, db.clone())
+ let response = if let Some(ref pool) = aof_pool {
+ crate::command::persistence::bgrewriteaof_start(pool, db.clone())
} else {
Frame::Error(Bytes::from_static(b"ERR AOF is not enabled"))
};
@@ -655,7 +658,11 @@ pub async fn handle_connection(
// WAL must be durable BEFORE the swap (no rollback
// path for SWAPDB). Try-send first; on failure return
// an error and leave both DBs untouched.
- let wal_ok = if let Some(ref tx) = aof_tx {
+ // Drop down to the pool sender so we can still observe
+ // try_send's Result (the fire-and-forget
+ // pool.try_send_append loses the SendFailed signal we
+ // need to abort the swap cleanly).
+ let wal_ok = if let Some(ref pool) = aof_pool {
let mut a_buf = itoa::Buffer::new();
let mut b_buf = itoa::Buffer::new();
let wal_frame = Frame::Array(crate::framevec![
@@ -671,12 +678,16 @@ pub async fn handle_connection(
crate::persistence::aof::serialize_command(
&wal_frame,
);
- tx.try_send(
- crate::persistence::aof::AofMessage::Append(
- serialized,
- ),
- )
- .is_ok()
+ // Single-shard mode — shard_id = 0.
+ let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, serialized.len());
+ pool.sender(0)
+ .try_send(
+ crate::persistence::aof::AofMessage::Append {
+ lsn,
+ bytes: serialized,
+ },
+ )
+ .is_ok()
} else {
true // persistence disabled — no durability requirement
};
@@ -878,8 +889,16 @@ pub async fn handle_connection(
}
// Send AOF entries accumulated so far
for bytes in aof_entries.drain(..) {
- if let Some(ref tx) = aof_tx {
- let _ = tx.send_async(AofMessage::Append(bytes)).await;
+ if let Some(ref pool) = aof_pool {
+ // Single-shard mode (shard_id = 0). send_async
+ // preserves back-pressure semantics from the
+ // pre-pool code; the pool's TopLevel layout
+ // routes to the same single writer.
+ let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len());
+ let _ = pool
+ .sender(0)
+ .send_async(AofMessage::Append { lsn, bytes })
+ .await;
}
if let Some(ref counter) = change_counter {
counter.fetch_add(1, Ordering::Relaxed);
@@ -1510,8 +1529,14 @@ pub async fn handle_connection(
(resp, records)
};
for record in wal_records {
- if let Some(ref tx) = aof_tx {
- let _ = tx.send_async(AofMessage::Append(bytes::Bytes::from(record))).await;
+ if let Some(ref pool) = aof_pool {
+ // Single-shard mode (shard_id = 0).
+ let bytes = bytes::Bytes::from(record);
+ let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len());
+ let _ = pool
+ .sender(0)
+ .send_async(AofMessage::Append { lsn, bytes })
+ .await;
}
if let Some(ref counter) = change_counter {
counter.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
@@ -1528,7 +1553,7 @@ pub async fn handle_connection(
let is_write = metadata::is_write(cmd);
// Serialize for AOF before dispatch
- let aof_bytes = if is_write && aof_tx.is_some() {
+ let aof_bytes = if is_write && aof_pool.is_some() {
let mut buf = BytesMut::new();
crate::protocol::serialize::serialize(&frame, &mut buf);
Some(buf.freeze())
@@ -2232,8 +2257,14 @@ pub async fn handle_connection(
// --- Send AOF entries OUTSIDE the lock ---
for bytes in aof_entries {
- if let Some(ref tx) = aof_tx {
- let _ = tx.send_async(AofMessage::Append(bytes)).await;
+ if let Some(ref pool) = aof_pool {
+ // Single-shard mode (shard_id = 0). send_async preserves
+ // back-pressure semantics from the pre-pool code.
+ let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len());
+ let _ = pool
+ .sender(0)
+ .send_async(AofMessage::Append { lsn, bytes })
+ .await;
}
if let Some(ref counter) = change_counter {
counter.fetch_add(1, Ordering::Relaxed);
diff --git a/src/server/conn/tests.rs b/src/server/conn/tests.rs
index 2f6c74e1..8a175eaf 100644
--- a/src/server/conn/tests.rs
+++ b/src/server/conn/tests.rs
@@ -31,7 +31,7 @@ fn test_inline_get_hit() {
}
let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"[..]);
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -40,7 +40,8 @@ fn test_inline_get_hit() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -56,7 +57,7 @@ fn test_inline_get_miss() {
let dbs = make_dbs();
let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"[..]);
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -65,7 +66,8 @@ fn test_inline_get_miss() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -84,7 +86,7 @@ fn test_inline_set_falls_through_when_writes_disabled() {
let mut read_buf = BytesMut::from(&cmd[..]);
let original_len = read_buf.len();
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -93,7 +95,8 @@ fn test_inline_set_falls_through_when_writes_disabled() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -111,7 +114,7 @@ fn test_inline_set_executes_when_writes_enabled() {
let cmd = b"*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n";
let mut read_buf = BytesMut::from(&cmd[..]);
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -120,7 +123,8 @@ fn test_inline_set_executes_when_writes_enabled() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
true,
@@ -150,7 +154,7 @@ fn test_inline_set_with_options_falls_through() {
let mut read_buf = BytesMut::from(&cmd[..]);
let original_len = read_buf.len();
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -159,7 +163,8 @@ fn test_inline_set_with_options_falls_through() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
true,
@@ -176,7 +181,7 @@ fn test_inline_fallthrough() {
let mut read_buf = BytesMut::from(&ping_cmd[..]);
let original_len = read_buf.len();
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -185,7 +190,8 @@ fn test_inline_fallthrough() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -211,7 +217,7 @@ fn test_inline_mixed_batch() {
read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n");
read_buf.extend_from_slice(b"*1\r\n$4\r\nPING\r\n");
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
// Inline loop should process GET but leave PING
@@ -221,7 +227,8 @@ fn test_inline_mixed_batch() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -244,7 +251,7 @@ fn test_inline_case_insensitive() {
}
let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nget\r\n$3\r\nfoo\r\n"[..]);
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -253,7 +260,8 @@ fn test_inline_case_insensitive() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -271,7 +279,7 @@ fn test_inline_partial() {
let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\n"[..]);
let original_len = read_buf.len();
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let result = try_inline_dispatch(
@@ -280,7 +288,8 @@ fn test_inline_partial() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -296,7 +305,8 @@ fn test_inline_set_with_aof_falls_through_when_writes_disabled() {
// SET falls through when can_inline_writes=false even with AOF.
let dbs = make_dbs();
let (aof_sender, _aof_receiver) = channel::mpsc_bounded::(16);
- let aof_tx: Option> = Some(aof_sender);
+ let aof_pool: Option> =
+ Some(crate::persistence::aof::AofWriterPool::top_level(aof_sender));
let cmd = b"*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n";
let mut read_buf = BytesMut::from(&cmd[..]);
let original_len = read_buf.len();
@@ -310,7 +320,8 @@ fn test_inline_set_with_aof_falls_through_when_writes_disabled() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
@@ -342,7 +353,7 @@ fn test_inline_multiple_gets() {
read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$1\r\na\r\n");
read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$1\r\nb\r\n");
let mut write_buf = BytesMut::new();
- let aof_tx: Option> = None;
+ let aof_pool: Option> = None;
let rt_config = make_rt_config();
let total = try_inline_dispatch_loop(
@@ -351,7 +362,8 @@ fn test_inline_multiple_gets() {
&dbs,
0,
0,
- &aof_tx,
+ &aof_pool,
+ &None,
0,
1,
false,
diff --git a/src/server/conn_state.rs b/src/server/conn_state.rs
index 674857db..d2946f36 100644
--- a/src/server/conn_state.rs
+++ b/src/server/conn_state.rs
@@ -8,7 +8,7 @@ use ringbuf::HeapProd;
use crate::blocking::BlockingRegistry;
use crate::cluster::ClusterState;
use crate::config::{RuntimeConfig, ServerConfig};
-use crate::persistence::aof::AofMessage;
+use crate::persistence::aof::AofWriterPool;
use crate::protocol::Frame;
use crate::pubsub::PubSubRegistry;
use crate::replication::state::ReplicationState;
@@ -38,7 +38,12 @@ pub struct ConnectionContext {
pub blocking_registry: Rc>,
pub shutdown: CancellationToken,
pub requirepass: Option,
- pub aof_tx: Option>,
+ /// Per-shard AOF writer pool. **Sole AOF interface** after the
+ /// 2d/2e migration sequence. Built by spawn sites in `shard/conn_accept.rs`
+ /// from the manifest layout — TopLevel wraps a single sender,
+ /// PerShard owns one sender per shard. Step 2f flips spawn sites
+ /// to construct PerShard pools when the on-disk manifest demands it.
+ pub aof_pool: Option>,
pub tracking_table: Rc>,
pub repl_state: Option>>,
pub cluster_state: Option>>,
diff --git a/src/server/embedded.rs b/src/server/embedded.rs
index 55a2b3fd..e1620fa3 100644
--- a/src/server/embedded.rs
+++ b/src/server/embedded.rs
@@ -48,7 +48,7 @@ use parking_lot::RwLock;
use tracing::info;
use crate::config::ServerConfig;
-use crate::persistence::aof::{self, AofMessage, FsyncPolicy};
+use crate::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy};
use crate::runtime::cancel::CancellationToken;
use crate::runtime::channel;
use crate::runtime::{RuntimeFactoryImpl, traits::RuntimeFactory};
@@ -113,9 +113,12 @@ pub async fn run_embedded(
// AOF writer: dedicated std::thread (matches main.rs lifetime model).
// We retain the JoinHandle so shutdown can wait for the writer to finish
// flushing — dropping it would race the process exit and risk losing the
- // final fsync (CodeRabbit #1).
- let (aof_tx, aof_join): (
- Option>,
+ // final fsync (CodeRabbit #1). Step 2f-α: wrap the sender in a TopLevel
+ // `AofWriterPool` so every shard receives an Arc clone with a uniform
+ // API. Channel close still drives writer termination — dropping the last
+ // Arc drops the pool, which drops the underlying senders.
+ let (aof_pool, aof_join): (
+ Option>,
Option>,
) = if config.appendonly == "yes" {
let (tx, rx) = channel::mpsc_bounded::(10_000);
@@ -132,7 +135,7 @@ pub async fn run_embedded(
})
.context("embedded moon: failed to spawn AOF writer thread")?;
info!("embedded moon: AOF enabled (fsync: {:?})", fsync);
- (Some(tx), Some(handle))
+ (Some(AofWriterPool::top_level(tx)), Some(handle))
} else {
(None, None)
};
@@ -263,7 +266,7 @@ pub async fn run_embedded(
let consumers = mesh.take_consumers(id);
let conn_rx = mesh.take_conn_rx(id);
let shard_cancel = cancel.clone();
- let shard_aof_tx = aof_tx.clone();
+ let shard_aof_pool = aof_pool.clone();
let shard_bind_addr = bind_addr.clone();
let shard_persistence_dir = persistence_dir.clone();
let shard_snap_rx = snap_rx.clone();
@@ -293,7 +296,7 @@ pub async fn run_embedded(
consumers,
producers,
shard_cancel,
- shard_aof_tx,
+ shard_aof_pool,
Some(shard_bind_addr),
shard_persistence_dir,
shard_snap_rx,
@@ -354,10 +357,12 @@ pub async fn run_embedded(
// Listener exited (cancel fired or fatal error). Shutdown ordering:
// 1. cancel.cancel() — stops shard accept loops + producers.
- // 2. Join shard threads — drops every shard-held `aof_tx` clone.
- // 3. Drop our outer `aof_tx` — last sender goes away, the AOF writer's
- // `recv_async()` returns `Err(_)` and the task flushes + fsyncs
- // before exiting (see `aof::aof_writer_task` Err arm).
+ // 2. Join shard threads — drops every shard-held `Arc`
+ // clone (each `ConnectionContext` and each `shard_aof_pool` capture).
+ // 3. Drop our outer `aof_pool` Arc — the last reference goes away, the
+ // pool's `Drop` runs, dropping the underlying `Vec`. The
+ // AOF writer's `recv_async()` returns `Err(_)` and the task flushes
+ // + fsyncs before exiting (see `aof::aof_writer_task` Err arm).
// 4. Join the AOF thread.
//
// This sequencing fixes Qodo bug #5: sending `AofMessage::Shutdown` before
@@ -381,8 +386,9 @@ pub async fn run_embedded(
}
}
- // Drop the last AOF sender so the writer's recv loop sees channel close.
- drop(aof_tx);
+ // Drop the last `Arc` so the underlying senders close,
+ // triggering the writer's recv loop to drain + fsync + exit.
+ drop(aof_pool);
let aof_panic = if let Some(handle) = aof_join {
match handle.join() {
diff --git a/src/server/listener.rs b/src/server/listener.rs
index 607aef54..37300cf2 100644
--- a/src/server/listener.rs
+++ b/src/server/listener.rs
@@ -13,7 +13,7 @@ use tracing::{debug, error, info};
use crate::command::connection as conn_cmd;
use crate::config::ServerConfig;
#[cfg(feature = "runtime-tokio")]
-use crate::persistence::aof::{self, AofMessage, FsyncPolicy};
+use crate::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy};
#[cfg(feature = "runtime-tokio")]
use crate::persistence::rdb;
#[cfg(feature = "runtime-tokio")]
@@ -115,15 +115,17 @@ pub async fn run_with_shutdown(
let config = Arc::new(config);
- // Set up AOF writer task if appendonly is enabled
- let aof_tx: Option> = if config.appendonly == "yes" {
+ // Set up AOF writer task + wrap the sender as a TopLevel `AofWriterPool`
+ // (step 2f-α). Single-shard tokio path: `handler_single` receives the pool
+ // directly; per-shard layout is unreachable from this listener.
+ let aof_pool: Option> = if config.appendonly == "yes" {
let (tx, rx) = channel::mpsc_bounded::(10_000);
let aof_token = token.child_token();
let fsync = FsyncPolicy::from_str(&config.appendfsync);
let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename);
tokio::spawn(aof::aof_writer_task(rx, aof_file_path, fsync, aof_token));
info!("AOF enabled with fsync policy: {:?}", fsync);
- Some(tx)
+ Some(AofWriterPool::top_level(tx))
} else {
None
};
@@ -235,7 +237,7 @@ pub async fn run_with_shutdown(
let conn_token = token.child_token();
let requirepass = config.requirepass.clone();
let config = config.clone();
- let aof_tx = aof_tx.clone();
+ let aof_pool_conn = aof_pool.clone();
let change_counter = Some(change_counter.clone());
let pubsub = pubsub_registry.clone();
let rt_config = runtime_config.clone();
@@ -249,7 +251,7 @@ pub async fn run_with_shutdown(
let gs = graph_store.clone();
tokio::spawn(connection::handle_connection(
stream, db, conn_token, requirepass, config,
- aof_tx, change_counter, pubsub, rt_config,
+ aof_pool_conn, change_counter, pubsub, rt_config,
tracking, cid, Some(rs), acl, Some(vs),
Some(ts),
#[cfg(feature = "graph")]
@@ -263,9 +265,12 @@ pub async fn run_with_shutdown(
}
_ = token.cancelled() => {
info!("Server shutting down");
- // Send shutdown to AOF writer
- if let Some(ref tx) = aof_tx {
- let _ = tx.send_async(AofMessage::Shutdown).await;
+ // Fan out shutdown to every AOF writer (single writer under TopLevel,
+ // one-per-shard under PerShard — step 2f-β). `broadcast_shutdown`
+ // is `try_send`-based so the writer must be draining; under tokio
+ // listener it always is.
+ if let Some(ref pool) = aof_pool {
+ pool.broadcast_shutdown();
}
break;
}
diff --git a/src/shard/conn_accept.rs b/src/shard/conn_accept.rs
index 432d65aa..006fc885 100644
--- a/src/shard/conn_accept.rs
+++ b/src/shard/conn_accept.rs
@@ -104,7 +104,7 @@ pub(crate) fn spawn_tokio_connection(
pubsub_arc: &Arc>,
blocking_rc: &Rc>,
shutdown: &CancellationToken,
- aof_tx: &Option>,
+ aof_pool: &Option>,
tracking_rc: &Rc>,
lua_rc: &Rc>>>,
script_cache_rc: &Rc>,
@@ -134,7 +134,6 @@ pub(crate) fn spawn_tokio_connection(
let psr = pubsub_arc.clone();
let blk = blocking_rc.clone();
let sd = shutdown.clone();
- let aof = aof_tx.clone();
let trk = tracking_rc.clone();
let cid = conn_cmd::next_client_id();
let rs = repl_state.clone();
@@ -170,7 +169,10 @@ pub(crate) fn spawn_tokio_connection(
set_tcp_keepalive(tcp_stream.as_raw_fd(), tcp_keepalive_secs);
}
- // Construct ConnectionContext from cloned shared state
+ // Construct ConnectionContext from cloned shared state. The pool is
+ // already built by the spawn site (main.rs / listener.rs / embedded.rs)
+ // and threaded through `Shard::run` — we just clone the Arc once here.
+ let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
let conn_ctx = crate::server::conn::ConnectionContext::new(
sdbs,
shard_id,
@@ -178,7 +180,7 @@ pub(crate) fn spawn_tokio_connection(
psr,
blk,
reqpass,
- aof,
+ pool_for_ctx,
trk,
rs,
cs,
@@ -272,7 +274,7 @@ pub(crate) fn spawn_migrated_tokio_connection(
pubsub_arc: &Arc>,
blocking_rc: &Rc>,
shutdown: &CancellationToken,
- aof_tx: &Option>,
+ aof_pool: &Option>,
tracking_rc: &Rc>,
lua_rc: &Rc>>>,
script_cache_rc: &Rc>,
@@ -326,7 +328,6 @@ pub(crate) fn spawn_migrated_tokio_connection(
let psr = pubsub_arc.clone();
let blk = blocking_rc.clone();
let sd = shutdown.clone();
- let aof = aof_tx.clone();
let trk = tracking_rc.clone();
let cid = state.client_id;
let rs = repl_state.clone();
@@ -354,6 +355,8 @@ pub(crate) fn spawn_migrated_tokio_connection(
let migration_buf = take_migration_read_buf(&mut state);
+ // Pool is built by the spawn site and threaded through here.
+ let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
let conn_ctx = crate::server::conn::ConnectionContext::new(
sdbs,
shard_id,
@@ -361,7 +364,7 @@ pub(crate) fn spawn_migrated_tokio_connection(
psr,
blk,
None, // requirepass: None = pre-authenticated
- aof,
+ pool_for_ctx,
trk,
rs,
cs,
@@ -421,7 +424,7 @@ pub(crate) fn spawn_monoio_connection(
pubsub_arc: &Arc>,
blocking_rc: &Rc>,
shutdown: &CancellationToken,
- aof_tx: &Option>,
+ aof_pool: &Option>,
tracking_rc: &Rc>,
lua_rc: &Rc>>>,
script_cache_rc: &Rc>,
@@ -464,7 +467,6 @@ pub(crate) fn spawn_monoio_connection(
let psr = pubsub_arc.clone();
let blk = blocking_rc.clone();
let sd = shutdown.clone();
- let aof = aof_tx.clone();
let trk = tracking_rc.clone();
let cid = conn_cmd::next_client_id();
let rs = repl_state.clone();
@@ -500,12 +502,14 @@ pub(crate) fn spawn_monoio_connection(
.map(|a| a.to_string())
.unwrap_or_else(|_| "unknown".to_string());
- // Construct ConnectionContext from cloned shared state
+ // Construct ConnectionContext from cloned shared state. Pool is
+ // built by the spawn site and threaded through here.
let reqpass = rtcfg.read().requirepass.clone();
+ let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
let conn_ctx = crate::server::conn::ConnectionContext::new(
- sdbs, shard_id, num_shards, psr, blk, reqpass, aof, trk, rs, cs, lua, sc, cp, acl,
- rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, all_regs, all_rsm, aff, spill_tx,
- spill_fid, do_dir,
+ sdbs, shard_id, num_shards, psr, blk, reqpass, pool_for_ctx, trk, rs, cs, lua,
+ sc, cp, acl, rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, all_regs, all_rsm,
+ aff, spill_tx, spill_fid, do_dir,
);
let maxclients = conn_ctx.runtime_config.read().maxclients;
@@ -713,7 +717,7 @@ pub(crate) fn spawn_migrated_monoio_connection(
pubsub_arc: &Arc>,
blocking_rc: &Rc>,
shutdown: &CancellationToken,
- aof_tx: &Option>,
+ aof_pool: &Option>,
tracking_rc: &Rc>,
lua_rc: &Rc>>>,
script_cache_rc: &Rc>,
@@ -766,7 +770,6 @@ pub(crate) fn spawn_migrated_monoio_connection(
let psr = pubsub_arc.clone();
let blk = blocking_rc.clone();
let sd = shutdown.clone();
- let aof = aof_tx.clone();
let trk = tracking_rc.clone();
let cid = state.client_id;
let rs = repl_state.clone();
@@ -802,11 +805,13 @@ pub(crate) fn spawn_migrated_monoio_connection(
let migration_buf = take_migration_read_buf(&mut state);
+ // Pool is built by the spawn site and threaded through here.
+ let pool_for_ctx = aof_pool.as_ref().map(Arc::clone);
let conn_ctx = crate::server::conn::ConnectionContext::new(
sdbs, shard_id, num_shards, psr, blk,
None, // requirepass: None = pre-authenticated
- aof, trk, rs, cs, lua, sc, cp, acl, rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm,
- all_regs, all_rsm, aff, spill_tx, spill_fid, do_dir,
+ pool_for_ctx, trk, rs, cs, lua, sc, cp, acl, rtcfg, scfg, dtx, notifiers,
+ snap_tx, clk, rsm, all_regs, all_rsm, aff, spill_tx, spill_fid, do_dir,
);
monoio::spawn(async move {
diff --git a/src/shard/event_loop.rs b/src/shard/event_loop.rs
index 2b9de27d..5198d930 100644
--- a/src/shard/event_loop.rs
+++ b/src/shard/event_loop.rs
@@ -56,7 +56,7 @@ impl super::Shard {
mut consumers: Vec>,
producers: Vec>,
shutdown: CancellationToken,
- aof_tx: Option>,
+ aof_pool: Option>,
bind_addr: Option,
persistence_dir: Option,
snapshot_trigger_rx: channel::WatchReceiver,
@@ -1047,7 +1047,7 @@ impl super::Shard {
conn_accept::spawn_tokio_connection(
tcp_stream, false, &tls_config,
&shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
- &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+ &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
&acl_table, &runtime_config, &server_config, &all_notifiers,
&snapshot_trigger_tx, &repl_state, &cluster_state,
&cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1097,7 +1097,7 @@ impl super::Shard {
conn_accept::spawn_tokio_connection(
tcp_stream, is_tls, &tls_config,
&shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
- &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+ &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
&acl_table, &runtime_config, &server_config, &all_notifiers,
&snapshot_trigger_tx, &repl_state, &cluster_state,
&cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1180,7 +1180,7 @@ impl super::Shard {
conn_accept::spawn_migrated_tokio_connection(
fd, state,
&shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
- &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+ &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
&acl_table, &runtime_config, &server_config, &all_notifiers,
&snapshot_trigger_tx, &repl_state, &cluster_state,
&cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1193,7 +1193,7 @@ impl super::Shard {
conn_accept::spawn_migrated_monoio_connection(
fd, state,
&shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
- &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+ &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
&acl_table, &runtime_config, &server_config, &all_notifiers,
&snapshot_trigger_tx, &repl_state, &cluster_state,
&cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1277,7 +1277,7 @@ impl super::Shard {
conn_accept::spawn_migrated_tokio_connection(
fd, state,
&shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
- &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+ &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
&acl_table, &runtime_config, &server_config, &all_notifiers,
&snapshot_trigger_tx, &repl_state, &cluster_state,
&cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1290,7 +1290,7 @@ impl super::Shard {
conn_accept::spawn_migrated_monoio_connection(
fd, state,
&shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc,
- &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc,
+ &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc,
&acl_table, &runtime_config, &server_config, &all_notifiers,
&snapshot_trigger_tx, &repl_state, &cluster_state,
&cached_clock, &remote_sub_map_arc, &all_pubsub_registries,
@@ -1697,7 +1697,7 @@ impl super::Shard {
&pubsub_arc,
&blocking_rc,
&shutdown,
- &aof_tx,
+ &aof_pool,
&tracking_rc,
&lua_rc,
&script_cache_rc,
@@ -1739,7 +1739,7 @@ impl super::Shard {
&pubsub_arc,
&blocking_rc,
&shutdown,
- &aof_tx,
+ &aof_pool,
&tracking_rc,
&lua_rc,
&script_cache_rc,
@@ -1921,7 +1921,7 @@ impl super::Shard {
&pubsub_arc,
&blocking_rc,
&shutdown,
- &aof_tx,
+ &aof_pool,
&tracking_rc,
&lua_rc,
&script_cache_rc,
diff --git a/tests/ft_search_multi_shard_as_of.rs b/tests/ft_search_multi_shard_as_of.rs
index c8f8fef1..c2f93f11 100644
--- a/tests/ft_search_multi_shard_as_of.rs
+++ b/tests/ft_search_multi_shard_as_of.rs
@@ -46,6 +46,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/ft_search_temporal_parity.rs b/tests/ft_search_temporal_parity.rs
index 065de5f9..1165f81a 100644
--- a/tests/ft_search_temporal_parity.rs
+++ b/tests/ft_search_temporal_parity.rs
@@ -58,6 +58,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/integration.rs b/tests/integration.rs
index 1e6b51c8..d3c920ee 100644
--- a/tests/integration.rs
+++ b/tests/integration.rs
@@ -32,6 +32,7 @@ async fn start_server() -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
@@ -131,6 +132,7 @@ async fn start_server_with_pass(password: &str) -> (u16, CancellationToken) {
databases: 16,
requirepass: Some(password.to_string()),
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
@@ -1302,6 +1304,7 @@ async fn start_server_with_persistence(
databases: 16,
requirepass: None,
appendonly: appendonly.to_string(),
+ unsafe_multishard_aof: false,
appendfsync: appendfsync.to_string(),
save: None,
dir: dir.to_string_lossy().to_string(),
@@ -2185,6 +2188,7 @@ async fn start_server_with_maxmemory(maxmemory: usize, policy: &str) -> (u16, Ca
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
@@ -2595,6 +2599,7 @@ async fn start_sharded_server(num_shards: usize) -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
@@ -3785,6 +3790,7 @@ async fn start_cluster_server() -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
@@ -4446,6 +4452,7 @@ async fn start_server_with_aclfile(acl_path: &str) -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/kill_snapshot.rs b/tests/kill_snapshot.rs
index 0a96c81f..4141b13c 100644
--- a/tests/kill_snapshot.rs
+++ b/tests/kill_snapshot.rs
@@ -28,6 +28,7 @@ fn base_config(port: u16, num_shards: usize) -> ServerConfig {
databases: 1,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "no".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/mq_integration.rs b/tests/mq_integration.rs
index 6423a8e8..0ed42b5d 100644
--- a/tests/mq_integration.rs
+++ b/tests/mq_integration.rs
@@ -44,6 +44,7 @@ async fn start_mq_server(num_shards: usize) -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/replication_test.rs b/tests/replication_test.rs
index 3e36686c..634ee733 100644
--- a/tests/replication_test.rs
+++ b/tests/replication_test.rs
@@ -30,6 +30,7 @@ async fn start_server() -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: dir_path,
diff --git a/tests/txn_ft_search_snapshot.rs b/tests/txn_ft_search_snapshot.rs
index f442afe1..edbdc710 100644
--- a/tests/txn_ft_search_snapshot.rs
+++ b/tests/txn_ft_search_snapshot.rs
@@ -46,6 +46,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/txn_kv_wiring.rs b/tests/txn_kv_wiring.rs
index 4a7ec97c..00921f8e 100644
--- a/tests/txn_kv_wiring.rs
+++ b/tests/txn_kv_wiring.rs
@@ -49,6 +49,7 @@ async fn start_txn_server(num_shards: usize, persistence_dir: &str) -> (u16, Can
databases: 16,
requirepass: None,
appendonly,
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir,
diff --git a/tests/vacuum_commands.rs b/tests/vacuum_commands.rs
index 4b37fc59..2d861220 100644
--- a/tests/vacuum_commands.rs
+++ b/tests/vacuum_commands.rs
@@ -35,6 +35,7 @@ fn base_config(port: u16) -> ServerConfig {
databases: 1,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "no".to_string(),
save: None,
dir: ".".to_string(),
diff --git a/tests/workspace_integration.rs b/tests/workspace_integration.rs
index fb833edd..08e1e0c2 100644
--- a/tests/workspace_integration.rs
+++ b/tests/workspace_integration.rs
@@ -37,6 +37,7 @@ async fn start_workspace_server(num_shards: usize) -> (u16, CancellationToken) {
databases: 16,
requirepass: None,
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),
@@ -254,6 +255,7 @@ async fn start_workspace_server_with_auth(
databases: 16,
requirepass: Some(password),
appendonly: "no".to_string(),
+ unsafe_multishard_aof: false,
appendfsync: "everysec".to_string(),
save: None,
dir: ".".to_string(),