diff --git a/CHANGELOG.md b/CHANGELOG.md index 1be8c9d0..9a6e1798 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -6,10 +6,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [0.2.0-alpha] — Unreleased -First slice of the v0.2 "Option C" beachhead: **Point-in-Time Recovery (PITR)** -and **Change Data Capture (CDC)**, built additively on top of the existing -per-shard WAL v3 + dual-root manifest. No changes to the KV hot path, MVCC, -page format, or transaction layer. +The v0.2 enterprise beachhead. Built additively on per-shard WAL v3 + the +dual-root manifest; no changes to the KV hot path, MVCC, page format, or +transaction layer. + +**Headline capabilities landed in alpha:** + +- **Point-in-Time Recovery (PITR)** — `--recovery-target-lsn` / + `--recovery-target-time` restore to any LSN or wall-clock boundary + inside the WAL retention window. +- **Change Data Capture (CDC)** — `CDC.READ` polling command with + Debezium-compatible JSON envelopes, resumable cursors, segment-rotation + safety. +- **Hash-field TTL** — full Valkey 9.0 / 9.1 surface (`HEXPIRE` / + `HPEXPIRE` / `HEXPIREAT` / `HPEXPIREAT` / `HEXPIRETIME` / `HPEXPIRETIME` + / `HTTL` / `HPTTL` / `HPERSIST` / `HGETDEL` / `HGETEX`) with O(1) HGET + + HLEN fast path. Three-way benchmark vs Redis 8.0.2 / Valkey 9.1.0 + ships in `docs/perf/2026-05-27-hash-ttl-3way-bench.md`. +- **Tier 2 Lane A** — `SWAPDB`, `MOVE`, `COPY ... DB n`, + `CLUSTER REPLICAS` / `SLAVES`, `CLUSTER COUNT-FAILURE-REPORTS`. All + WAL-durable with cross-shard atomic semantics. +- **Storage format v1 commitment** — RDB v2 + WAL v3 + multi-part AOF + manifest grouped under a single `--storage-format v1` umbrella with + ≥18-month LTS forward-read guarantees. +- **Embedded sharded server** — `server::embedded::run_embedded(config, + cancel)` exposes the full sharded handler (with `TXN.*`) to in-process + embedders. + +**What is not yet in alpha:** PITR live-snapshot LSN wiring (P3c), +`CDC.SUBSCRIBE` push channel (C3b), and the multi-shard master PSYNC +deferred from v0.1.10. Tracked in `.planning/rfcs/v02-enterprise-architecture.md`. ### Docs — Hash-field TTL three-way benchmark suite (PR #127) @@ -427,6 +453,96 @@ See `docs/guides/cdc.md` for consumer integration. and the benchmark gates (PITR restart ±10%, CDC ≥100K events/s/shard, write p99 ±5%). +## [0.1.12] — 2026-05-12 + +Performance & memory observability release. 50 commits since v0.1.11, no +public API breaks, no on-disk format change. Validated on OrbStack `moon-dev` +(2026-05-12) and locally green for both `runtime-monoio` and +`runtime-tokio,jemalloc`. + +### Performance — DashTable hot-path (Phase 189, PERF-07 + PERF-09) + +- **Pre-sized DashTable.** `DashTable::with_capacity()` plus the new + `--initial-keyspace-hint ` flag size the segment array up front so + steady-state operation hits zero `split_segment` calls. Pre-size + invariant test confirms zero splits at 1 M keys. The 27 % CPU spent + in `split_segment` during SET p=16 (PERF-07) is fully eliminated. +- **`Database::set` rewrite.** New `DashTable::insert_or_update` / + `Segment::insert_or_update_at` single-probe helpers replace the + previous `find + remove + insert` triple-probe pattern. +- **`Segment::find` fallback elimination + force-inlined SIMD.** The + cold "key spilled to non-home group" fallback path is removed once + `has_non_home_keys` is invariant-tracked on insert (the + `insert_or_update_at` change above already maintains the flag); + the SIMD probe helpers are `#[inline(always)]`. PERF-09 + attributed 12.65 % of `Segment::find` self-time to the fallback; + remaining cost is the irreducible per-hit `memcmp` confirm + (threshold amended to <3 %). 1 M-key correctness gate validates + zero false positives/negatives. + +### Performance — Memory observability (Phase 190) + +- **`moon_memory_bytes{kind=…}` Prometheus gauge.** Seven subsystem + labels — `dashtable`, `hnsw`, `csr`, `wal`, `sealed_replication_backlog`, + `allocator_overhead`, and the rolled-up `total`. Updated every + scrape via a single hook so the sum reconciles to `RSS` within the + CI tolerance window. +- **`MEMORY DOCTOR` full schema.** Multi-line RESP response covering + every subsystem, the rolled-up total, and a derived `allocator_overhead` + pseudo-kind (RSS − Σ subsystems). Adds operator triage signal beyond + the legacy single-line summary. +- **`resident_bytes()` trait** implemented across `Database`, + `DashTable`, `VectorStore` (HNSW + IVF), `GraphStore` (CSR + SlotMap), + `WalWriter`, `ReplicationBacklog` (sealed-segment side), and + `AllocatorOverhead`. Zero-allocation, on-demand poll. +- **Memory steady-state CI job.** `scripts/bench-memory-steady-state.sh` + + baseline fixture; gate widened to `±10 %` on RSS / Σ ratio after a + Linux-CI tolerance pass. + +### Changed — Allocator UX (Phase 191) + +- **jemalloc `narenas:8` cap** with `--memory-arenas-cap ` CLI + override. Caps the per-CPU arena explosion that inflates VSZ on + high-core hosts; mostly a cosmetic fix on Linux containers but + produces a meaningfully tighter `top`/`ps` reading for operators. +- **Tri-state allocator selection.** New `mimalloc-alt` cargo + feature alongside the existing `jemalloc` / `mimalloc` (fallback) + paths; mutually exclusive at compile time. A/B benchmark script + `scripts/bench-allocator-ab.sh` ships with the release. +- **`docs/OPERATOR-GUIDE.md` — Memory Accounting section.** Documents + the VSZ-vs-RSS distinction, MEMORY DOCTOR field-by-field, and the + `--memory-arenas-cap` / `mimalloc-alt` tuning knobs. + +### Added — Dispatch Observability (Phase 177) + +- **`moon_dispatch_path_total{path=...}` Prometheus counter**: four-way classification of every command by shard-routing decision — `local_inline` (SIMD fast path), `local` (standard local branch), `cross_read_fast` (RwLock shared-read bypass of SPSC), `cross_spsc` (deferred cross-shard write via `PipelineBatchSlotted`). Ratio `cross_spsc / Σ` is the ground-truth signal for dispatch-layer optimization work. Zero-allocation hot-path overhead (`&'static str` labels, `#[inline]` with early-return on `!METRICS_INITIALIZED`). Verified on macOS + Linux: counter sums close exactly to driven traffic, no overcount. + +### Changed + +- **`text-index` is now a default feature.** BM25 full-text search (`FT.SEARCH` BM25 mode), `FT.AGGREGATE`, and three-way RRF hybrid fusion are included in all standard builds. No longer requires `--features text-index`. To exclude it (e.g. minimal embedded builds): `--no-default-features --features runtime-monoio,jemalloc,graph`. + +### Added — SDK Validation + +- **Python SDK `sdk/python/examples/validate.py`**: End-to-end live validator for all SDK sub-clients: ping, strings, counter, hash, list, set, zset, vector index lifecycle, graph engine, session search, semantic cache, text search (BM25 + aggregate + hybrid), and server info. Result against Moon with `text-index`: **114 PASS / 0 FAIL / 0 SKIP**. Gracefully skips text sections when server built without `text-index`. +- **Rust SDK `sdk/rust/examples/validate.rs`**: Re-validated against `text-index` build — **85 PASS / 0 FAIL**. + +### Fixed — Python SDK + +- **`moondb.graph._parse_neighbors`**: server returns alternating `[edge_map, node_map, ...]` as flat key-value arrays (`b'id'`, `int`, `b'src'`, `int`, `b'dst'`, `int`, …). Previous parser expected positional `[node_id, label, props]` — caused `int() on b'id'` crash. Now correctly identifies node entries by `labels` key and parses them from the flat kv format. + +### Fixed — CI Hygiene + +- **`tests/pipeline_auto_index.rs`**: tighten outer cfg from `runtime-tokio` to `all(runtime-tokio, text-index)` so the file compiles to zero tests when text-index is disabled. Previously the file compiled but the FT.SEARCH text fast path was `#[cfg]`-ed out, causing `@name:corpus` queries to fall through to the KNN-only parser and panic with "invalid KNN query syntax". +- **4 FT unwraps**: add inline `#[allow(clippy::unwrap_used)]` with invariant justifications in `vector_search/ft_text_search.rs` (3 sites inside `apply_post_processing` where `do_summarize` / `do_highlight` implies the Option is Some) and `handler_monoio/ft.rs:165` (`is_text` was derived from `query_bytes.as_ref().map_or(false, _)`). Restores the audit-unwrap baseline to 0. + +### Compatibility + +- **Wire protocol**: unchanged. Drop-in replacement for v0.1.11. +- **Persistence on-disk format**: unchanged. +- **Default feature set**: `text-index` is now on by default. Minimal + embedded builds need an explicit `--no-default-features --features + runtime-monoio,jemalloc,graph`. + ## [0.1.11] — 2026-04-27 Hot-path perf release — eliminates two atomic-CAS hot paths in the write @@ -537,29 +653,47 @@ All 2450 lib tests passing locally (`cargo test --no-default-features - **Persistence on-disk format**: unchanged. - **Replication wire format**: unchanged. -## [Unreleased] - -### Added — Dispatch Observability (Phase 177) - -- **`moon_dispatch_path_total{path=...}` Prometheus counter**: four-way classification of every command by shard-routing decision — `local_inline` (SIMD fast path), `local` (standard local branch), `cross_read_fast` (RwLock shared-read bypass of SPSC), `cross_spsc` (deferred cross-shard write via `PipelineBatchSlotted`). Ratio `cross_spsc / Σ` is the ground-truth signal for dispatch-layer optimization work. Zero-allocation hot-path overhead (`&'static str` labels, `#[inline]` with early-return on `!METRICS_INITIALIZED`). Verified on macOS + Linux: counter sums close exactly to driven traffic, no overcount. - -### Fixed — CI Hygiene - -- **`tests/pipeline_auto_index.rs`**: tighten outer cfg from `runtime-tokio` to `all(runtime-tokio, text-index)` so the file compiles to zero tests when text-index is disabled. Previously the file compiled but the FT.SEARCH text fast path was `#[cfg]`-ed out, causing `@name:corpus` queries to fall through to the KNN-only parser and panic with "invalid KNN query syntax". -- **4 FT unwraps**: add inline `#[allow(clippy::unwrap_used)]` with invariant justifications in `vector_search/ft_text_search.rs` (3 sites inside `apply_post_processing` where `do_summarize` / `do_highlight` implies the Option is Some) and `handler_monoio/ft.rs:165` (`is_text` was derived from `query_bytes.as_ref().map_or(false, _)`). Restores the audit-unwrap baseline to 0. - -### Changed - -- **`text-index` is now a default feature.** BM25 full-text search (`FT.SEARCH` BM25 mode), `FT.AGGREGATE`, and three-way RRF hybrid fusion are included in all standard builds. No longer requires `--features text-index`. To exclude it (e.g. minimal embedded builds): `--no-default-features --features runtime-monoio,jemalloc,graph`. - -### Added — SDK Validation - -- **Python SDK `sdk/python/examples/validate.py`**: End-to-end live validator for all SDK sub-clients: ping, strings, counter, hash, list, set, zset, vector index lifecycle, graph engine, session search, semantic cache, text search (BM25 + aggregate + hybrid), and server info. Result against Moon with `text-index`: **114 PASS / 0 FAIL / 0 SKIP**. Gracefully skips text sections when server built without `text-index`. -- **Rust SDK `sdk/rust/examples/validate.rs`**: Re-validated against `text-index` build — **85 PASS / 0 FAIL**. - -### Fixed — Python SDK - -- **`moondb.graph._parse_neighbors`**: Server returns alternating `[edge_map, node_map, ...]` as flat key-value arrays (`b'id'`, `int`, `b'src'`, `int`, `b'dst'`, `int`, …). Previous parser expected positional `[node_id, label, props]` — caused `int() on b'id'` crash. Now correctly identifies node entries by `labels` key and parses them from the flat kv format. +## [0.1.10] — 2026-04-23 + +Stable replication marker. **Single-shard PSYNC2 wired end-to-end and +production-ready** for `--shards 1` master with any `--shards N` replica +topology. Multi-shard master PSYNC is scheduled for v0.2 (see +`.planning/rfcs/multi-shard-replication-design.md`). + +- **Replication** (`081c43b`): single-shard master PSYNC2 end-to-end wired, + REPLCONF validated, `master_link_status` reports the actual handshake + state instead of the legacy `up` stub. +- **Performance**: batch-level eviction gate; `try_handle_*` paths + `#[inline]`-ed; DashTable carries through the v0.1.10 pre-size + groundwork (capacity hint + headroom). +- **Docs**: BENCHMARK.md §2.7 updated with the 2026-04-22 GCloud + re-measurement; v0.1.x replication scope documented under + `docs/guides/clustering.mdx#replication`. + +## [0.1.9] — 2026-04-19 + +**Lunaris Retriever Gap Closure.** Every v0.1.8 client-side fallback in +the Lunaris SDK is now closed so `HybridRRFRetriever` (dense path), +`GraphFirstRetriever`, and `PathReasoningRetriever` run Moon-native. + +- **Phase 167 CYP-01/02**: Cypher `CREATE` / `MERGE` writes participate + in `CrossStoreTxn` via `record_graph()`; `TXN.ABORT` rolls them back. +- **Phase 168 CYP-03/06**: `coalesce()` built-in + single-hop edge-var + binding in variable-length `EXPAND`. +- **Phase 169 CYP-04/05**: `shortestPath()` parser + Dijkstra executor + bridge with path-variable binding. +- **Phase 170 HYB-01/02/04**: `FT.SEARCH HYBRID` dense stream honours + `as_of_lsn`. +- **Phase 171 SCAT-01/02/03**: `ShardMessage::VectorSearch` + + `FtHybridPayload` carry `as_of_lsn` for multi-shard `AS_OF` correctness. +- **Phase 172 PIPE-01/02/03**: pipeline-aware HSET auto-indexing + regression guard (3-test suite). + +Audit status: **PASSED_WITH_DOCUMENTED_DEFERRALS**. 15 / 20 requirements +fully satisfied; HYB-03 BM25 MVCC deferred and closed in v0.1.10 +follow-up (G-1); Phase 173 hygiene HYG-02 handler split RFC'd. + +Stats: 6 phases shipped, 17 plans, 27 files changed, +2924 / −376 LOC. ## [0.1.8] — 2026-04-18 @@ -837,7 +971,14 @@ All 2450 lib tests passing locally (`cargo test --no-default-features - **Security hardening:** `deny.toml` (cargo-deny), `SECURITY.md`, `docs/THREAT-MODEL.md`, `docs/security/lua-sandbox.md`, TLS cipher suite freeze - **Release engineering:** `docs/versioning.md`, 6 operator runbooks, CHANGELOG CI gate, user docs (getting-started, configuration, monitoring), release pipeline SHA256 checksums + SBOM + cosign -## [Earlier Unreleased] - Dispatch Hot-Path Recovery (2026-04-08) +## [0.1.3] — 2026-04-10 + +Production-readiness foundation: dispatch hot-path recovery, vector-search +4× QPS + correctness fixes, and the tiered disk-offload landing with 100 % +crash recovery across 7 persistence configurations. Bundles three work +streams originally tracked as separate Unreleased blocks (Apr 7–8). + +### Dispatch Hot-Path Recovery (2026-04-08) **Pipelined SET +37%, pipelined GET +68% at p=16 after PR #43 regression recovery.** @@ -934,9 +1075,7 @@ captured as todo in `.planning/todos/pending/`. --- -## [Unreleased] - Vector Search 4x QPS + Correctness - -### Vector Search Performance & Correctness (2026-04-07) +### Vector Search 4× QPS + Correctness (2026-04-07) **4x search QPS, 4.1x lower latency, 2.56x faster than Qdrant on real MiniLM data.** @@ -991,9 +1130,9 @@ embeddings (clustered) achieve 0.92-0.97 recall with the same code. --- -## [Earlier Unreleased] - Disk Offload & x86_64 Performance +### Disk Offload & x86_64 Performance (2026-04-06) -Tiered storage, crash recovery, and 2x Redis on x86_64 (Intel Xeon, io_uring). +Tiered storage, crash recovery, and 2× Redis on x86_64 (Intel Xeon, io_uring). ### Added diff --git a/README.md b/README.md index 5fcfbbe7..350c8861 100644 --- a/README.md +++ b/README.md @@ -7,11 +7,12 @@

- Version + Version Rust SDK Python SDK License - Status + Status + Cluster status Rust Protocol

@@ -20,17 +21,38 @@ Quick startWhy MoonBenchmarks • + Moon vs Redis vs Valkey • + Production readinessDocsChangelog

--- -> **⚠ Experimental.** Moon is under active development and **not** recommended for production. Storage formats, APIs, and config flags may change between releases. Please [open an issue](https://github.com/pilotspace/moon/issues) if something breaks. +> **Production-grade architecture, pre-1.0 maturity.** Single-node Moon +> (`--shards N` master, `--shards 1` for replication-eligible workloads) +> is recommended for production caching, vector / graph / feature-store +> workloads, and Redis-compatible OLTP. Multi-node clustering and +> multi-shard master PSYNC are **alpha in v0.2** — see +> [Production readiness](#production-readiness) for the honest matrix of +> what is and isn't yet GA. Wire protocol and on-disk format are LTS as of +> v0.2 (`docs/STORAGE-FORMAT-V1.md`); CLI flags may still evolve until +> v1.0. [Open an issue](https://github.com/pilotspace/moon/issues) if +> something breaks. --- -Moon speaks the Redis wire protocol (RESP2/RESP3) and implements 230+ commands. It runs on **Linux** (io_uring via monoio) and **macOS** (kqueue via monoio) with a thread-per-core, shared-nothing architecture, per-shard WAL, tiered disk offload, an in-process vector search engine with BM25 full-text search, a property graph engine with Cypher subset, cross-store ACID transactions, workspace partitioning, durable message queues, bi-temporal MVCC, and an embedded web console. Any Redis client connects out of the box. +Moon is a clean-room Rust rewrite of a Redis-compatible in-memory data +store with first-class AI primitives. It speaks the Redis wire protocol +(RESP2/RESP3) and implements 230+ commands — every standard Redis data +type plus native `FT.*` vector + BM25 search, `GRAPH.*` Cypher, `TXN.*` +cross-store ACID, workspaces, durable message queues, bi-temporal MVCC, +and an embedded web console. Primary target is **Linux** with io_uring +(`monoio`); a `tokio` runtime is available for portability. **macOS** is a +supported development platform (kqueue via monoio); production +deployments should target Linux (see +[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) Tier 1/2). +Any Redis client connects out of the box. ## Why Moon @@ -61,18 +83,30 @@ Moon speaks the Redis wire protocol (RESP2/RESP3) and implements 230+ commands. ## Benchmarks -Measured vs Redis 8.6.1, co-located client and server, pipeline depth tuned per row. Full methodology and reproduction steps in [BENCHMARK.md](BENCHMARK.md) and [docs/benchmarks.mdx](docs/benchmarks.mdx). +Measured vs Redis 8.6.1 (peak throughput) and Redis 8.0.2 + Valkey 9.1.0 +(hash-TTL surface, the only workload where all three were head-to-head +benchmarked). Co-located client and server, pipeline depth tuned per +row. Full methodology and reproduction steps in +[BENCHMARK.md](BENCHMARK.md) and [docs/benchmarks.mdx](docs/benchmarks.mdx). +Valkey peak-throughput columns are intentionally blank — a head-to-head +peak-RPS bench on identical hardware has not yet been run; the +[Moon vs Redis vs Valkey](#moon-vs-redis-vs-valkey) section quotes +Valkey's vendor-published 2.1M RPS (9 I/O threads, p=10) for context. ### Peak throughput (GCloud c3-standard-8, x86_64, monoio io_uring) -| Workload | Moon | Redis | Ratio | -|----------------------------------|-------:|------:|:------:| -| Peak GET (c=50, p=64) | 5.11M | 2.98M | **1.72×** | -| Peak SET (c=50, p=64) | 3.50M | 1.82M | **1.92×** | -| GET, production defaults (AOF+disk-offload) | 4.76M | 2.46M | **1.93×** | -| GET, max durability (fsync always)| 4.85M | 2.45M | **1.98×** | -| Memory, values ≥ 1 KB | — | — | **27–35% less** | -| Crash recovery (SIGKILL, 5K keys)| 100% | 100% | parity | +| Workload | Moon | Redis 8.6.1 | Valkey 9.1.0 | +|------------------------------------------------|-------:|------------:|-------------:| +| Peak GET (c=50, p=64) | 5.11M | 2.98M (1.72×) | not yet benched | +| Peak SET (c=50, p=64) | 3.50M | 1.82M (1.92×) | not yet benched | +| GET, production defaults (AOF + disk-offload) | 4.76M | 2.46M (1.93×) | not yet benched | +| GET, max durability (`fsync=always`) | 4.85M | 2.45M (1.98×) | not yet benched | +| Memory, values ≥ 1 KB | — | **27–35 % less** | not yet benched* | +| Crash recovery (SIGKILL, 5K keys) | 100 % | 100 % (parity)| 100 % (parity, vendor-claimed) | + +*Valkey 9.1 raised the embstr threshold to 128 B; below ~64 B Valkey +9.1 may be tighter than Moon. A head-to-head re-bench across the value +size curve is on the v0.2 roadmap. ### ARM64 (GCloud t2a-standard-8, Neoverse-N1) @@ -98,9 +132,13 @@ Measured vs Redis 8.6.1, co-located client and server, pipeline depth tuned per | Native API QPS | **19×** | N/A | | Bulk insert | **23×** | 1× | -### Hash-field TTL — Valkey 9.0/9.1 parity (OrbStack moon-dev, n=200K c=50, median of 3) +### Hash-field TTL — Valkey 9.0/9.1 parity (v0.2.0-alpha preview) -Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodology + 26 scenarios in [docs/perf/2026-05-27-hash-ttl-3way-bench.md](docs/perf/2026-05-27-hash-ttl-3way-bench.md); reproducible via [scripts/bench-hash-ttl-3way.sh](scripts/bench-hash-ttl-3way.sh). +> **Status:** ships in **v0.2.0-alpha** (currently on `main`, no tagged +> release yet). Not present in the latest tagged build `v0.1.12`. Build +> from `main` to reproduce these numbers. + +OrbStack moon-dev, n=200K c=50, median of 3. Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodology + 26 scenarios in [docs/perf/2026-05-27-hash-ttl-3way-bench.md](docs/perf/2026-05-27-hash-ttl-3way-bench.md); reproducible via [scripts/bench-hash-ttl-3way.sh](scripts/bench-hash-ttl-3way.sh). | Command | Pipeline | Moon | Redis 8.0.2 | Valkey 9.1.0 | |------------------|----------|------:|------------:|-------------:| @@ -111,7 +149,64 @@ Three-way comparison on the per-field TTL surface added in v0.2.0. Full methodol | `HGETEX EX` | p=1 | 250K | N/A | 251K | | `HGETEX` no-mode | p=1 | 250K | N/A | 253K | -Plain Hash HGET p=16 ties Redis (1.01×) and Valkey (1.00×). HEXPIRE-family Moon vs Valkey: 0.90–0.99× across the surface (Valkey leads HEXPIRE p=16 by 7%, HTTL p=16 by 10%; HGETEX hits 0.99× parity). Redis 8.x has no HEXPIRE-family — Moon is the only Redis-compatible alternative aside from Valkey. The internal `HashWithTtl` HGET / HLEN paths use a cached `min_expiry_ms` for an O(1) fast path that brings them to 1.03× of plain `Hash` (was 80× slower pre-fix; see PR #126). +Plain Hash HGET p=16 ties Redis (1.01×) and Valkey (1.00×). HEXPIRE-family Moon vs Valkey: 0.90–0.99× across the surface (Valkey leads HEXPIRE p=16 by 7 %, HTTL p=16 by 10 %; HGETEX hits 0.99× parity). Redis 8.x has no HEXPIRE-family — Moon is the only Redis-compatible alternative aside from Valkey. The internal `HashWithTtl` HGET / HLEN paths use a cached `min_expiry_ms` for an O(1) fast path that brings them to 1.03× of plain `Hash` (was 80× slower pre-fix; see PR #126). + +## Moon vs Redis vs Valkey + +Three Redis-protocol-compatible servers, three different bets. Moon +competes on a **vertical moat** — thread-per-core architecture and an +AI-native in-core surface. Valkey competes on a **horizontal moat** — +Linux Foundation governance, every major cloud, drop-in compatibility. +Redis OSS continues as the upstream reference but ships under SSPL since +March 2024. The deep architectural review is in +[`docs/comparison-valkey.md`](docs/comparison-valkey.md) (~22 KB, +traced to source). + +| Dimension | **Moon v0.1.12 / 0.2.0-alpha** | **Valkey 9.1.0** | **Redis 8.6.1 (OSS)** | +|--------------------------------------|--------------------------------|-----------------------------|----------------------------| +| Language / license | Rust 2024 / Apache-2.0 | C99 / BSD-3-Clause (LF TSC) | C99 / **SSPL** since 2024 | +| Threading model | Thread-per-core, shared-nothing | Single main thread + ≤9 I/O threads | Single-threaded core (+ I/O threads) | +| I/O driver (Linux) | io_uring (`monoio`) | epoll only | epoll only | +| Snapshot | **Forkless** (segment-level COW) | `fork()` + COW | `fork()` + COW | +| AOF / WAL | Per-shard WAL v3 + multi-part AOF | Single global AOF | Single global AOF | +| Tiered NVMe disk offload | **Yes** (under `maxmemory`) | No (OSS) | No (OSS — Redis Flash is Enterprise) | +| Vector search | **In-core** HNSW + TurboQuant 1-8 bit | `valkey-search` module, FP32 only | RediSearch module | +| Full-text BM25 | **In-core** | `valkey-search` module | RediSearch module | +| Property graph (Cypher) | **In-core** 14 `GRAPH.*` cmds | None | None (FalkorDB separate) | +| Cross-store ACID | `TXN.BEGIN/COMMIT/ABORT` | None | None | +| Hash-field TTL (`HEXPIRE`-family) | **Yes** (Valkey-parity, v0.2-alpha) | **Yes** (9.0+) | No | +| PITR + CDC | `--recovery-target-lsn` + `CDC.READ` (v0.2-alpha) | None | None | +| Embedded web console | **Yes** (7-view React, in-binary) | Valkey Admin GUI 1.0 (separate) | Redis Insight (separate) | +| Managed cloud offerings | None (yet) | AWS, GCP, Oracle, Aiven, … | Redis Cloud (vendor) | +| Multi-node cluster, soak-tested | **v0.2 alpha** (single-node GA today) | **Production** | **Production** | +| Atomic slot migration | Planned (v0.2) | Yes (9.0) | No | +| Peak single-server throughput | **5.11M GET/s** (c3-8 x86_64) | 2.1M RPS (9 I/O threads, p=10, vendor) | 2.98M GET/s (c3-8, same harness as Moon) | + +### When to choose Moon + +- Single-node Redis-compatible workloads where peak throughput, + memory efficiency at ≥1 KB values, or **forkless snapshots** matter. +- AI-native applications: vector search, GraphRAG, semantic cache, + hybrid BM25 + dense + sparse retrieval — all in one binary, no module + loader, with cross-store ACID across KV / vector / graph. +- Workloads that benefit from **tiered NVMe offload** under `maxmemory` + instead of LRU-eviction-then-rebuild. + +### When to choose Valkey + +- Multi-node clusters with proven 1000+ node operational mileage. +- Managed-cloud-only deployments (every major cloud offers Valkey). +- Strict drop-in compatibility with the Redis 7.2 module ecosystem + (`valkey-json`, `valkey-bloom`, `valkey-search`, `valkey-ldap`). +- Risk-averse environments where Linux Foundation governance is a + procurement requirement. + +### When to stay on Redis OSS + +- Existing investments in RediSearch / RedisJSON / RedisBloom under + RLEC, or pre-SSPL-tolerance OSS Redis. +- Specific Redis Enterprise features (CRDT active-active, Redis Flash) + that Moon and Valkey OSS do not match. ## Quick start @@ -148,9 +243,9 @@ OK "world" 127.0.0.1:6379> HSET user:1 name Alice age 30 (integer) 2 -127.0.0.1:6379> HEXPIRE user:1 3600 FIELDS 1 age # Valkey 9.0 per-field TTL +127.0.0.1:6379> HEXPIRE user:1 3600 FIELDS 1 age # Valkey 9.0 per-field TTL (v0.2.0-alpha; build from `main`) 1) (integer) 1 -127.0.0.1:6379> HTTL user:1 FIELDS 1 age +127.0.0.1:6379> HTTL user:1 FIELDS 1 age # v0.2.0-alpha 1) (integer) 3600 127.0.0.1:6379> FT.CREATE idx ON HASH PREFIX 1 doc: SCHEMA emb VECTOR HNSW 6 DIM 384 TYPE FLOAT32 DISTANCE_METRIC COSINE OK @@ -326,34 +421,123 @@ cargo flamegraph --bin moon -- --port 6399 --shards 1 Contribution guide and coding rules (unsafe policy, hot-path allocation rules, lock discipline) are in [CLAUDE.md](CLAUDE.md) and [UNSAFE_POLICY.md](UNSAFE_POLICY.md). -## Roadmap - -Moon is pre-1.0 and **experimental**. Current focus: - -- Correctness parity with Redis 8.x across the full command surface -- AI-native primitives: session dedup, hybrid vector+sparse search, agentic caching -- Multi-node clustering with gossip, slot migration, and PSYNC2 replication -- GPU-accelerated vector search (CUDA, feature-gated) -- Production hardening and SLO validation (see [docs/PRODUCTION-CONTRACT.md](docs/PRODUCTION-CONTRACT.md)) - -Completed in v0.1.0–v0.1.8: -- Tiered disk offload (RAM → NVMe) with 100% crash recovery -- In-process vector search (HNSW + TurboQuant 4/8-bit) with `FT.*` API -- BM25 full-text search with three-way hybrid fusion (BM25 + dense + sparse) -- Property graph engine with Cypher subset (14 `GRAPH.*` commands) -- Cross-store ACID transactions (`TXN.BEGIN`/`COMMIT`/`ABORT`) across KV, vector, and graph -- Workspace partitioning for multi-tenant namespace isolation -- Durable message queues with dead-letter and debounced triggers -- Bi-temporal MVCC for point-in-time KV and graph queries -- Web console (7 views, embedded in binary) -- macOS support (aarch64 + x86_64, both runtimes) -- Thread-per-core dispatch optimization (5.11M GET/s on x86_64) - -Production readiness is **not** a v0.1 goal. Storage formats, APIs, and config flags may change between releases. - -## Production Readiness - -Moon's v1.0 promises — SLOs, durability modes, supported platforms, and a machine-checkable GA exit-criteria checklist — live in **[docs/PRODUCTION-CONTRACT.md](docs/PRODUCTION-CONTRACT.md)**. The contract is the single source of truth every v0.1.3 hardening phase tests against. +## Production readiness + +Honest matrix of where Moon is today (v0.1.12 GA + v0.2.0-alpha +unreleased). Read alongside +[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) (the +machine-checkable GA exit criteria) and +[`docs/OPERATOR-GUIDE.md`](docs/OPERATOR-GUIDE.md) (memory accounting, +sizing, runbooks). + +### Recommended for production today + +- **Single-node deployments** — Linux aarch64 (Tier 1) or Linux x86_64 + (Tier 2). `--shards N` master, one process, one node. +- **Read replication** — `--shards 1` master with any `--shards N` + replica topology. Single-shard PSYNC2 is wired end-to-end since + v0.1.10. +- **AI workloads** — vector search (HNSW + TurboQuant), BM25 full-text, + GraphRAG, semantic caching, hybrid retrieval. All in-core, all + RDB / WAL durable, all crash-recovery validated. +- **Cache + feature store** — durability modes are honest + (`always` / `everysec` / `no` with documented recovery bounds), forkless + snapshots remove the Redis fork-COW RSS spike, tiered NVMe offload + under `maxmemory` keeps working sets larger than RAM. +- **Crash recovery** — 100 % survived across 7 persistence + configurations and 5 K-key SIGKILL workloads. RDB v2 + WAL v3 + + multi-part AOF + tiered cold tier all participate. + +### Not yet GA — avoid for production + +- **Multi-node clustering** (16 K-slot gossip, MOVED/ASK, failover) — + protocol-compatible code exists but **PSYNC2 atomic slot migration + is not soak-tested**. Valkey 9.0 shipped this; Moon has not. + Scheduled for v0.2. +- **Multi-shard master PSYNC** — single-shard only today; multi-shard + master replication is RFC'd in + [`.planning/rfcs/multi-shard-replication-design.md`](.planning/rfcs/multi-shard-replication-design.md). +- **PITR live-snapshot LSN wiring** (P3c) and **`CDC.SUBSCRIBE` push + channel** (C3b) — `CDC.READ` polling is alpha-ready; push and + zero-snapshot PITR are deferred to v0.2 follow-ups. +- **GPU vector acceleration** (`gpu-cuda` feature) — kernel scaffold + exists; production kernels not yet shipping. +- **macOS native** — first-class development platform with full feature + set minus io_uring, but production deployments should target Linux + per + [`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) tiers. +- **Performance SLO numbers in + [`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md)** — marked + `[provisional]` until the Phase 97 24-h HDR-histogram rig validates + them on reference hardware. Use the benchmarks above as point-in-time + measurements, not committed SLOs. + +### Operator gotchas worth knowing before you deploy + +- **Multi-shard scaling needs load.** Aim for `clients ≥ 25 × shards` + on pipeline ≥ 16 workloads; below that, multiple shards under-subscribe + the dispatch loop and a single shard wins. Random keyspaces with + `p=1` benefit less than `{tag}`-co-located keys (see + [`CLAUDE.md`](CLAUDE.md) "Gotchas" + the v0.1.12 multi-shard memo). +- **Fairness flags for benchmarking against Redis / Valkey** — + `--disk-offload disable --appendonly no` removes Moon's durability + overhead (~26 % on SET p=64) so comparisons are apples-to-apples. +- **Memory accounting** — bill on RSS, not VSZ. `MEMORY DOCTOR`, + `moon_memory_bytes{kind=…}` Prometheus gauge, and + [`docs/OPERATOR-GUIDE.md#memory-accounting`](docs/OPERATOR-GUIDE.md#memory-accounting) + cover the full VSZ-vs-RSS guide and tuning knobs + (`--memory-arenas-cap`, `mimalloc-alt`). +- **RDB hash-field TTL trailer** is RDB v2 only — older v1 readers stop + after the hash body and silently drop per-field TTLs. Pin storage + format to `v1` (the umbrella covering v2/v3 sub-formats) per + [`docs/STORAGE-FORMAT-V1.md`](docs/STORAGE-FORMAT-V1.md). + +### Roadmap + +| Milestone | Focus | Status | +|------------------|---------------------------------------------------------------------------------------------|-------------| +| **v0.2.0** (next) | Multi-node clustering soak (PSYNC2 + atomic slot migration); PITR P3c; CDC push (`SUBSCRIBE`) | alpha | +| **v0.2.x** | GPU vector acceleration (`gpu-cuda`); operator runbooks; full SLO lock-in (`PERF-01..05`) | planned | +| **v1.0** | Every [`PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md) GA exit-criteria box ticked | gate | + +What's already in `main` (v0.1.0 → v0.2.0-alpha, 14 months of work): + +> Hash-field TTL and PITR + CDC ship in **v0.2.0** — currently alpha on +> `main`, no tagged release yet. **v0.1.12** is the latest tag and does +> NOT include them. Everything else below is in `v0.1.12` GA. + +**Shipped in v0.1.12 (latest tag, single-node production-grade):** + +- Forkless persistence (RDB v2 + per-shard WAL v3 + multi-part AOF). +- Tiered disk offload (RAM → NVMe) with 100 % crash recovery. +- In-process vector search (HNSW + TurboQuant 1–8-bit) — `FT.*` surface. +- BM25 full-text search + three-way RRF hybrid fusion (BM25 + dense + sparse). +- Property graph engine with Cypher subset (14 `GRAPH.*` commands). +- Cross-store ACID (`TXN.BEGIN` / `COMMIT` / `ABORT`) across KV, vector, + graph. +- Workspaces, durable message queues, bi-temporal MVCC. +- Web console (7-view React app, embedded in binary). +- Thread-per-core dispatch optimization (5.11M GET/s on x86_64). + +**Added in v0.2.0-alpha (on `main`, untagged):** + +- Hash-field TTL — Valkey 9.0 / 9.1 parity, O(1) HGET / HLEN fast path + (`HEXPIRE`, `HEXPIREAT`, `HPEXPIRE`, `HPEXPIREAT`, `HEXPIRETIME`, + `HPEXPIRETIME`, `HTTL`, `HPTTL`, `HPERSIST`, `HGETEX`, `HGETDEL`). +- PITR — `--recovery-target-lsn` deterministic WAL replay-to-LSN. +- CDC — `CDC.READ` pull-mode change stream (push-mode planned for v0.2.0 + GA). +- Multi-node clustering soak (PSYNC2 + atomic slot migration). + +## Production Readiness Contract + +Moon's v1.0 promises — per-command-class SLOs, durability modes, supported +platforms, security guarantees, and a machine-checkable GA exit-criteria +checklist — live in +**[`docs/PRODUCTION-CONTRACT.md`](docs/PRODUCTION-CONTRACT.md)**. Every +v0.1.3+ hardening phase ticks off items on that checklist; nothing +promotes to `v1.0-rc1` until every box is green. The contract is the +single source of truth for what Moon owes you in production. ## Credits diff --git a/docs/runbooks/multi-shard-aof-rewrite.md b/docs/runbooks/multi-shard-aof-rewrite.md new file mode 100644 index 00000000..9c33b609 --- /dev/null +++ b/docs/runbooks/multi-shard-aof-rewrite.md @@ -0,0 +1,97 @@ +# Runbook — multi-shard + AOF refused at startup + +**Status:** active (Moon ≥ v0.1.13). Lifted in v2.0 once multi-shard +AOF replay walks every shard's segment manifest on recovery. + +## What you saw + +### At startup + +``` +REFUSING TO START: --shards 2 + --appendonly yes has a known data-loss +bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, +or pass --appendonly no for cache-only deployments, or pass +--unsafe-multishard-aof to acknowledge the risk and start anyway. See +docs/runbooks/multi-shard-aof-rewrite.md. +``` + +### Or at command time (defence-in-depth, fires under any escape-hatch) + +``` +> BGREWRITEAOF +(error) ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable + + --appendonly yes (known data-loss bug; see + docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1, set + --disk-offload disable, or wait for v2.0 multi-part AOF replay. +``` + +## Why this gate exists + +Verified on `main` at commit `6e49050` (2026-05-26), reproducers in +[`tmp/p0-no-rewrite.sh`](../../tmp/p0-no-rewrite.sh), +[`tmp/p0-always.sh`](../../tmp/p0-always.sh), +[`tmp/p0-multishard-no-offload.sh`](../../tmp/p0-multishard-no-offload.sh): + +| Configuration | Result | +|----------------------------------------------------------------------------|------------------------------| +| `--shards 1 --appendonly yes --appendfsync always` (control) | ✅ Recovers 5000 / 5000 | +| `--shards 1 --disk-offload enable --appendonly yes` (control) | ✅ Recovers 12 714 / 12 714 | +| `--shards 2 --disk-offload enable --appendonly yes --appendfsync everysec` | ❌ Loses 38 % (12 662 → 7 892) | +| `--shards 2 --disk-offload enable --appendonly yes --appendfsync always` | ❌ Loses 50 % (5000 → 2474) | +| `--shards 2 --disk-offload disable --appendonly yes --appendfsync always` | ❌ Loses 50 % (5000 → 2453) | + +**The bug is in the multi-shard AOF durability path itself**, not the +rewrite path and not the disk-offload tier. `--appendfsync always` and +`--disk-offload disable` do not save you — only `--shards 1` does. + +The rewrite-specific gate (the `BGREWRITEAOF` error above) is still +shipped as defence-in-depth for anyone who passes +`--unsafe-multishard-aof`. + +## How to recover from a triggered loss (if you hit this on < v0.1.13) + +1. If a recent RDB snapshot exists in `--dir`, stop the server, move + `appendonlydir/` aside, and let recovery rebuild from RDB + + surviving per-shard WAL only. RPO equals the time since the RDB + snapshot. +2. If replication was running, promote a non-affected replica + (`REPLICAOF NO ONE`) and re-sync the affected node. +3. If neither: data is lost. File a `P0` with the AOF manifest + contents (`appendonlydir/moon.aof.manifest`) and per-shard WAL + sizes. + +## How to avoid the gate + +Pick whichever matches your deployment: + +| Option | Trade-off | +|----------------------------------------------------|------------------------------------------------------------------------------------------| +| `--shards 1` | **Recommended.** Best throughput on non-pipelined workloads; gives up multi-shard fan-out | +| `--appendonly no` | Cache-only deployments; durability falls back to `save` rules + RDB recovery | +| `--unsafe-multishard-aof` | **Discouraged.** Acknowledges ~50 % loss on crash; suitable only for ephemeral caches | + +The first option also clears the `BGREWRITEAOF` defence-in-depth gate. + +## When will this be removed? + +v2.0 ships multi-shard AOF replay that walks every shard's segment +manifest on recovery. Both gates (startup refusal + `BGREWRITEAOF` +error) are removed at the same time. Track progress at +[`tmp/SHIP-PLAN-v1.0-rc1-single-node.md`](../../tmp/SHIP-PLAN-v1.0-rc1-single-node.md) +§ Track B. + +## Telemetry + +When `--unsafe-multishard-aof` is passed AND the suspect config is set, +the BGREWRITEAOF-specific gate also logs at startup at `WARN`: + +``` +BGREWRITEAOF gated for this config (known data-loss path; see +docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or +--disk-offload disable to re-enable rewrite. +``` + +Each gated `BGREWRITEAOF` invocation also returns the documented `ERR` +line at the wire, so any operator dashboard tailing `slowlog` or +client-side error counters will surface the refusal immediately. A +dedicated Prometheus gauge for both gates is on the v1.0-rc1 backlog. diff --git a/src/command/persistence.rs b/src/command/persistence.rs index 8ca5b48e..10980fe3 100644 --- a/src/command/persistence.rs +++ b/src/command/persistence.rs @@ -7,9 +7,7 @@ use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; use bytes::Bytes; use tracing::{error, info}; -use crate::runtime::channel; - -use crate::persistence::aof::AofMessage; +use crate::persistence::aof::{AofMessage, AofPoolSendError, AofWriterPool}; use crate::persistence::rdb; use crate::protocol::Frame; use crate::storage::Database; @@ -33,6 +31,21 @@ pub static BGSAVE_SHARDS_REMAINING: AtomicU64 = AtomicU64::new(0); /// Whether the last BGSAVE completed successfully. pub static BGSAVE_LAST_STATUS: AtomicBool = AtomicBool::new(true); +/// Process-wide gate set at startup when the configuration combination +/// `shards >= 2 + --disk-offload enable + --appendonly yes` is selected. +/// +/// `BGREWRITEAOF` under this combination silently truncates the WAL of every +/// shard except the rewriter's own shard while the consolidated multi-part AOF +/// base RDB written by the rewrite is **not** consumed on restart (verified +/// 2026-05-26 against HEAD `6e49050`: 38 % data loss reproducible). Until the +/// v2.0 multi-part AOF replay walks every shard's segment manifest, the only +/// safe behavior is to refuse the command in this config and point operators +/// at the runbook. +/// +/// Set once in `main.rs` after CLI parsing; never cleared. Checked by +/// `bgrewriteaof_start_sharded` before dispatching the rewrite message. +pub static MULTI_SHARD_AOF_REWRITE_UNSAFE: AtomicBool = AtomicBool::new(false); + /// Global flag indicating whether a BGREWRITEAOF rewrite is currently in progress. /// /// Set to `true` when `bgrewriteaof_start` or `bgrewriteaof_start_sharded` dispatches @@ -208,14 +221,31 @@ pub fn bgsave_shard_done(success: bool) { } } +/// Translate an `AofWriterPool` send failure into a user-facing RESP error. +/// Under PerShard layout, `pool.try_send_rewrite` returns +/// `RewriteUnsupportedInPerShard` — the per-shard rewrite path lands in +/// step 6 of the per-shard AOF RFC. Until then BGREWRITEAOF refuses with +/// a stable error rather than silently no-op'ing. +fn rewrite_pool_error_frame(err: AofPoolSendError) -> Frame { + match err { + AofPoolSendError::RewriteUnsupportedInPerShard => Frame::Error(Bytes::from_static( + b"ERR BGREWRITEAOF is not yet supported under per-shard AOF layout; per-shard rewrite ships in step 6 of the per-shard AOF migration", + )), + AofPoolSendError::SendFailed => Frame::Error(Bytes::from_static( + b"ERR Background AOF rewrite failed to start", + )), + } +} + /// Start a background AOF rewrite (BGREWRITEAOF command). /// -/// Sends a Rewrite message to the AOF writer task, which will generate -/// synthetic commands from current database state and replace the AOF file. +/// Submits a Rewrite message through the writer pool, which generates +/// synthetic commands from current database state and replaces the AOF +/// file. /// /// Uses CAS to set `AOF_REWRITE_IN_PROGRESS`: if a rewrite is already running, /// returns an error immediately without corrupting the in-flight rewrite state. -pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender, db: SharedDatabases) -> Frame { +pub fn bgrewriteaof_start(pool: &AofWriterPool, db: SharedDatabases) -> Frame { // CAS: only proceed if currently false; prevents a second caller from // clearing the flag while the first rewrite is still in progress. if AOF_REWRITE_IN_PROGRESS @@ -226,16 +256,15 @@ pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender, db: SharedDa b"ERR Background AOF rewrite already in progress", )); } - match aof_tx.try_send(AofMessage::Rewrite(db)) { + match pool.try_send_rewrite(AofMessage::Rewrite(db)) { Ok(()) => Frame::SimpleString(Bytes::from_static( b"Background append only file rewriting started", )), - Err(_) => { - // Channel send failed — rewrite never started; we set the flag so we clear it. + Err(e) => { + // Send failed (channel full) or PerShard rejection — rewrite never + // started, so clear the in-progress flag we just set. AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); - Frame::Error(Bytes::from_static( - b"ERR Background AOF rewrite failed to start", - )) + rewrite_pool_error_frame(e) } } } @@ -245,9 +274,18 @@ pub fn bgrewriteaof_start(aof_tx: &channel::MpscSender, db: SharedDa /// Uses CAS to set `AOF_REWRITE_IN_PROGRESS`: if a rewrite is already running, /// returns an error immediately without corrupting the in-flight rewrite state. pub fn bgrewriteaof_start_sharded( - aof_tx: &channel::MpscSender, + pool: &AofWriterPool, shard_databases: std::sync::Arc, ) -> Frame { + // Refuse the rewrite under the known-unsafe config combo (see the + // MULTI_SHARD_AOF_REWRITE_UNSAFE doc comment). This is the + // single-node v1.0-rc1 gate; the v2.0 multi-part AOF replay fix lifts + // it. + if MULTI_SHARD_AOF_REWRITE_UNSAFE.load(Ordering::Relaxed) { + return Frame::Error(Bytes::from_static( + b"ERR BGREWRITEAOF is unsafe with --shards >= 2 + --disk-offload enable + --appendonly yes (known data-loss bug; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1, set --disk-offload disable, or wait for v2.0 multi-part AOF replay.", + )); + } // CAS: only proceed if currently false; prevents a second caller from // clearing the flag while the first rewrite is still in progress. if AOF_REWRITE_IN_PROGRESS @@ -258,16 +296,15 @@ pub fn bgrewriteaof_start_sharded( b"ERR Background AOF rewrite already in progress", )); } - match aof_tx.try_send(AofMessage::RewriteSharded(shard_databases)) { + match pool.try_send_rewrite(AofMessage::RewriteSharded(shard_databases)) { Ok(()) => Frame::SimpleString(Bytes::from_static( b"Background append only file rewriting started", )), - Err(_) => { - // Channel send failed — rewrite never started; we set the flag so we clear it. + Err(e) => { + // Send failed (channel full) or PerShard rejection — rewrite never + // started, so clear the in-progress flag we just set. AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); - Frame::Error(Bytes::from_static( - b"ERR Background AOF rewrite failed to start", - )) + rewrite_pool_error_frame(e) } } } @@ -358,4 +395,66 @@ mod tests { // Note: verify the static exists and is accessible let _ = BGSAVE_LAST_STATUS.load(Ordering::Relaxed); } + + // Serialize the multi-shard gate test against any other test mutating + // the gate or AOF_REWRITE_IN_PROGRESS (parallel test runner otherwise + // races on the process-wide AtomicBools). + static GATE_TEST_LOCK: parking_lot::Mutex<()> = parking_lot::Mutex::new(()); + + #[test] + fn test_bgrewriteaof_sharded_refuses_under_unsafe_config() { + let _guard = GATE_TEST_LOCK.lock(); + // Use a small bounded channel so the test does not need an AOF + // writer task; the gate must fire BEFORE try_send is reached. + // Wrap as a TopLevel pool to match the post-2e-β helper signature. + let (tx, _rx) = crate::runtime::channel::mpsc_bounded::(1); + let pool = AofWriterPool::top_level(tx); + let shard_dbs = crate::shard::shared_databases::ShardDatabases::new( + vec![vec![crate::storage::Database::new()]], + ); + + // Snapshot prior state so the test is order-independent. + let prior = MULTI_SHARD_AOF_REWRITE_UNSAFE.load(Ordering::Relaxed); + let prior_in_progress = AOF_REWRITE_IN_PROGRESS.load(Ordering::SeqCst); + AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); + + // Gate ON → must refuse with the documented ERR (and must NOT flip + // AOF_REWRITE_IN_PROGRESS, otherwise a normal rewrite gets blocked). + MULTI_SHARD_AOF_REWRITE_UNSAFE.store(true, Ordering::Relaxed); + let frame = bgrewriteaof_start_sharded(&pool, shard_dbs.clone()); + match frame { + Frame::Error(msg) => { + let s = std::str::from_utf8(&msg).unwrap(); + assert!( + s.contains("BGREWRITEAOF is unsafe") + && s.contains("multi-shard-aof-rewrite.md"), + "unexpected error: {s}" + ); + } + other => panic!("expected Frame::Error, got {other:?}"), + } + assert!( + !AOF_REWRITE_IN_PROGRESS.load(Ordering::SeqCst), + "gate must not set AOF_REWRITE_IN_PROGRESS" + ); + + // Gate OFF → the gate error must NOT fire. (Without an AOF writer + // task draining the channel, the second call may succeed or return + // "failed to start" depending on buffer state; the contract under + // test here is only that the gate error is gone.) + MULTI_SHARD_AOF_REWRITE_UNSAFE.store(false, Ordering::Relaxed); + AOF_REWRITE_IN_PROGRESS.store(false, Ordering::SeqCst); + let frame2 = bgrewriteaof_start_sharded(&pool, shard_dbs); + if let Frame::Error(msg) = &frame2 { + let s = std::str::from_utf8(msg).unwrap(); + assert!( + !s.contains("BGREWRITEAOF is unsafe"), + "gate error fired with gate off: {s}" + ); + } + + // Restore prior state. + AOF_REWRITE_IN_PROGRESS.store(prior_in_progress, Ordering::SeqCst); + MULTI_SHARD_AOF_REWRITE_UNSAFE.store(prior, Ordering::Relaxed); + } } diff --git a/src/config.rs b/src/config.rs index b384a818..8ec55ad4 100644 --- a/src/config.rs +++ b/src/config.rs @@ -102,6 +102,17 @@ pub struct ServerConfig { #[arg(long, default_value = "yes")] pub appendonly: String, + /// Acknowledge the known multi-shard AOF durability bug and start + /// anyway. Verified 2026-05-26 on HEAD `6e49050`: + /// `--shards >= 2 + --appendonly yes` loses ~50 % of writes on SIGKILL + /// regardless of `--appendfsync` or `--disk-offload` settings. Until + /// the v2.0 multi-shard AOF replay lands, Moon refuses this config at + /// startup; pass this flag to override (e.g. cache-only deployments + /// where the loss is acceptable). See + /// `docs/runbooks/multi-shard-aof-rewrite.md`. + #[arg(long, default_value_t = false)] + pub unsafe_multishard_aof: bool, + /// AOF fsync policy (always/everysec/no) #[arg(long, default_value = "everysec")] pub appendfsync: String, diff --git a/src/main.rs b/src/main.rs index 8bff34ac..0e71efc5 100644 --- a/src/main.rs +++ b/src/main.rs @@ -44,7 +44,7 @@ use std::path::PathBuf; use clap::Parser; use moon::config::ServerConfig; -use moon::persistence::aof::{self, AofMessage, FsyncPolicy}; +use moon::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy}; use moon::runtime::cancel::CancellationToken; use moon::runtime::channel; use moon::runtime::{RuntimeFactoryImpl, traits::RuntimeFactory}; @@ -270,6 +270,24 @@ fn main() -> anyhow::Result<()> { info!("Starting with {} shards", num_shards); + // P0-FIX-01b: refuse to start under the known durability bug + // (`shards >= 2 + appendonly yes` loses ~50 % of writes on SIGKILL, + // verified 2026-05-26 on HEAD `6e49050`; reproducer in + // `tmp/p0-no-rewrite.sh` and `tmp/p0-always.sh`). The bug is + // independent of `--appendfsync` and `--disk-offload` settings. An + // operator can override via `--unsafe-multishard-aof` if the + // deployment is cache-only and the loss window is acceptable. + if num_shards >= 2 && config.appendonly == "yes" && !config.unsafe_multishard_aof { + eprintln!( + "REFUSING TO START: --shards {num_shards} + --appendonly yes has a known data-loss \ + bug on SIGKILL (~50 % loss verified 2026-05-26). Fix: use --shards 1, or pass \ + --appendonly no for cache-only deployments, or pass --unsafe-multishard-aof to \ + acknowledge the risk and start anyway. See \ + docs/runbooks/multi-shard-aof-rewrite.md." + ); + std::process::exit(2); + } + // T1.1: warn when maxclients < 25 × shards (undersubscription footgun). // Suppressed by MOON_NO_UNDERSUBSCRIPTION_WARN=1. if let Some(msg) = should_warn_undersubscription(config.maxclients, num_shards) @@ -287,30 +305,115 @@ fn main() -> anyhow::Result<()> { // Collect connection senders for the listener before spawning shard threads let conn_txs: Vec<_> = (0..num_shards).map(|i| mesh.conn_tx(i)).collect(); - // Set up AOF channel: single writer, all shards send to it via mpsc::Sender clones. - // The AOF writer task will be spawned on the listener runtime. - let aof_tx: Option> = if config.appendonly == "yes" { - let (tx, rx) = channel::mpsc_bounded::(10_000); - let aof_token = cancel_token.child_token(); - let fsync = FsyncPolicy::from_str(&config.appendfsync); - let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename); - // AOF writer task will be spawned on the listener runtime (see below) - // We store rx to spawn later since listener_rt hasn't been created yet. - // Instead, spawn on a dedicated thread so it's available before listener starts. - std::thread::Builder::new() - .name("aof-writer".to_string()) - .spawn(move || { - RuntimeFactoryImpl::block_on_local( - "aof-writer".to_string(), - aof::aof_writer_task(rx, aof_file_path, fsync, aof_token), + // Set up AOF writer channel(s) + `AofWriterPool` (step 2f-β: layout-aware). + // + // Logic: + // 1. If `appendonly == "yes"` and an existing on-disk manifest is found, + // verify its shard count matches `--shards` (RFC § 3 refusal — a + // mismatch silently maps shards to the wrong AOF files and is fatal). + // 2. If the manifest's layout is `PerShard` AND `num_shards >= 2`, + // spawn one writer per shard (`aof-writer-{N}` threads) and emit a + // `AofWriterPool::per_shard(senders)`. + // 3. Otherwise spawn the single legacy writer and emit + // `AofWriterPool::top_level(tx)`. This includes: + // - no manifest yet (fresh install — `initialize()` only writes + // TopLevel today; fresh-install PerShard creation lands later + // in the RFC sequence) + // - existing TopLevel manifest (legacy v1 or single-shard v2) + // - `num_shards == 1` (always TopLevel; per-shard fan-out has no + // meaning when there is one shard) + // + // A *corrupt* manifest is fatal — `AofManifest::load` returning `Err(_)` + // must NOT silently fall back to TopLevel, because the next write would + // create a fresh manifest overwriting the reference to the real base RDB + // and lose data. This mirrors the replay block at L514–526. + // + // Note: nothing today constructs a `layout == PerShard` manifest on disk + // (initialize() hardcodes TopLevel, migrate_top_level_to_per_shard is not + // yet wired into boot). The PerShard branch is reachable only by a + // hand-crafted manifest until step 9 lifts the multi-shard gate. Runtime + // behavior under default configurations stays byte-identical to step 2f-α. + use moon::persistence::aof_manifest::{AofLayout, AofManifest}; + let existing_manifest: Option = if config.appendonly == "yes" { + let base_dir = PathBuf::from(&config.dir); + match AofManifest::load(&base_dir) { + Ok(opt) => opt, + Err(e) => { + eprintln!( + "REFUSING TO START: AOF manifest at {}/appendonlydir/ is corrupt: {}. \ + Inspect manually before deleting; overwriting silently loses data.", + base_dir.display(), + e ); - }) - .expect("failed to spawn AOF writer thread"); - info!( - "AOF enabled with fsync policy: {:?}", - FsyncPolicy::from_str(&config.appendfsync) - ); - Some(tx) + std::process::exit(2); + } + } + } else { + None + }; + if let Some(ref m) = existing_manifest + && let Err(e) = m.verify_shard_count(num_shards as u16) + { + eprintln!("REFUSING TO START: {e}"); + std::process::exit(2); + } + + let aof_pool: Option> = if config.appendonly == "yes" { + let fsync = FsyncPolicy::from_str(&config.appendfsync); + let use_per_shard = matches!( + existing_manifest.as_ref().map(|m| m.layout), + Some(AofLayout::PerShard) + ) && num_shards >= 2; + + if use_per_shard { + let base_dir = PathBuf::from(&config.dir); + let mut senders = Vec::with_capacity(num_shards); + for sid in 0..num_shards { + let (tx, rx) = channel::mpsc_bounded::(10_000); + let aof_token = cancel_token.child_token(); + let base_dir = base_dir.clone(); + let thread_name = format!("aof-writer-{sid}"); + let thread_name_inner = thread_name.clone(); + std::thread::Builder::new() + .name(thread_name) + .spawn(move || { + RuntimeFactoryImpl::block_on_local( + thread_name_inner, + aof::per_shard_aof_writer_task( + rx, + base_dir, + sid as u16, + fsync, + aof_token, + ), + ); + }) + .expect("failed to spawn per-shard AOF writer thread"); + senders.push(tx); + } + info!( + "AOF enabled (PerShard, {} writers, fsync: {:?})", + num_shards, fsync + ); + Some(AofWriterPool::per_shard(senders)) + } else { + let (tx, rx) = channel::mpsc_bounded::(10_000); + let aof_token = cancel_token.child_token(); + let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename); + // Legacy single-writer thread. Each shard clones the outer + // `aof_pool` Arc; sender lifetime is governed by the pool's Drop. + std::thread::Builder::new() + .name("aof-writer".to_string()) + .spawn(move || { + RuntimeFactoryImpl::block_on_local( + "aof-writer".to_string(), + aof::aof_writer_task(rx, aof_file_path, fsync, aof_token), + ); + }) + .expect("failed to spawn AOF writer thread"); + info!("AOF enabled (TopLevel, fsync: {:?})", fsync); + Some(AofWriterPool::top_level(tx)) + } } else { None }; @@ -318,6 +421,23 @@ fn main() -> anyhow::Result<()> { // Compute bind address for SO_REUSEPORT per-shard listeners (Linux io_uring path). let bind_addr = format!("{}:{}", config.bind, config.port); + // P0-FIX-01: gate BGREWRITEAOF under the known data-loss config combo + // (multi-shard + disk-offload enabled + appendonly). Verified 2026-05-26: + // the rewrite truncates non-rewriter shards' WALs and the consolidated + // multi-part AOF base RDB is not consumed on restart, losing ~38 % of + // keys. v2.0 multi-part AOF replay lifts this; until then we refuse the + // command at dispatch time. See docs/runbooks/multi-shard-aof-rewrite.md. + if num_shards >= 2 && config.disk_offload_enabled() && config.appendonly == "yes" { + moon::command::persistence::MULTI_SHARD_AOF_REWRITE_UNSAFE + .store(true, std::sync::atomic::Ordering::Relaxed); + tracing::warn!( + shards = num_shards, + disk_offload = %config.disk_offload, + appendonly = %config.appendonly, + "BGREWRITEAOF gated for this config (known data-loss path; see docs/runbooks/multi-shard-aof-rewrite.md). Use --shards 1 or --disk-offload disable to re-enable rewrite." + ); + } + // Create watch channel for snapshot triggers (auto-save and BGSAVE) let (snapshot_trigger_tx, snapshot_trigger_rx) = moon::runtime::channel::watch(0u64); @@ -644,7 +764,7 @@ fn main() -> anyhow::Result<()> { } let conn_rx = mesh.take_conn_rx(id); let shard_cancel = cancel_token.clone(); - let shard_aof_tx = aof_tx.clone(); + let shard_aof_pool = aof_pool.clone(); let shard_bind_addr = bind_addr.clone(); let shard_persistence_dir = persistence_dir.clone(); let shard_snap_rx = snapshot_trigger_rx.clone(); @@ -676,7 +796,7 @@ fn main() -> anyhow::Result<()> { consumers, producers, shard_cancel, - shard_aof_tx, + shard_aof_pool, // Only pass bind_addr for per-shard SO_REUSEPORT when tokio // with io_uring is active. monoio uses central listener MPSC. #[cfg(feature = "runtime-tokio")] @@ -862,9 +982,11 @@ fn main() -> anyhow::Result<()> { }); } - // After listener exits, send AOF shutdown and cancel all shards - if let Some(ref tx) = aof_tx { - let _ = tx.send(AofMessage::Shutdown); + // After listener exits, send AOF shutdown to every writer and cancel all shards. + // Under TopLevel this is one send; under PerShard (step 2f-β) this fans out to + // every per-shard writer thread via `broadcast_shutdown`. + if let Some(ref pool) = aof_pool { + pool.broadcast_shutdown(); } cancel_token.cancel(); for handle in shard_handles { diff --git a/src/persistence/aof.rs b/src/persistence/aof.rs index a11bb7c5..e97c2910 100644 --- a/src/persistence/aof.rs +++ b/src/persistence/aof.rs @@ -57,8 +57,23 @@ impl FsyncPolicy { /// Messages sent to the AOF writer task via mpsc channel. pub enum AofMessage { - /// Append serialized RESP command bytes to the AOF file. - Append(Bytes), + /// Append serialized RESP command bytes to the AOF file, tagged with the + /// LSN that was issued for this write (`ReplicationState::issue_lsn`). + /// + /// `lsn` semantics by writer task: + /// - **TopLevel** (`aof_writer_task`): `lsn` is **ignored**; the legacy + /// v1 disk format is plain RESP bytes with no per-entry framing. + /// - **PerShard** (`per_shard_aof_writer_task`): `lsn` is **written** as + /// a u64 header per RFC § 2 Rule 1. Disk format per entry: + /// `[u64 lsn LE][u32 len LE][RESP bytes of length len]`. + /// Recovery reads `(lsn, cmd)` pairs and merges cross-shard + /// `OrderedAcrossShards` writes by LSN (RFC § 2 Rule 2). + /// + /// Construction sites that issue a real LSN call + /// `ReplicationState::issue_lsn(shard_id, bytes.len() as u64)` and pass + /// the returned value. Sites with no replication state available pass 0 + /// (TopLevel ignores it; PerShard treats 0 as "no ordering hint"). + Append { lsn: u64, bytes: Bytes }, /// Trigger a full AOF rewrite (compaction) using current database state. Rewrite(SharedDatabases), /// Trigger AOF rewrite in sharded mode (all shards' databases). @@ -67,6 +82,298 @@ pub enum AofMessage { Shutdown, } +/// Reasons a pool send may be refused without queueing. +#[derive(Debug, Clone, PartialEq, Eq)] +pub enum AofPoolSendError { + /// `Rewrite`/`RewriteSharded` sent to a `PerShard` pool. BGREWRITEAOF must + /// be issued per shard in the per-shard layout; the legacy single-writer + /// rewrite path is not applicable. + RewriteUnsupportedInPerShard, + /// Underlying channel send failed (writer task dead or channel full). + SendFailed, +} + +/// Bundle of per-shard AOF writer senders. +/// +/// The pool keeps the call-site API uniform regardless of layout: +/// - **TopLevel** (legacy v1, single-shard, also used for `--shards 1` v2): +/// exactly one writer thread; every `sender(shard_id)` returns the same +/// sender so all shards multiplex onto one file. +/// - **PerShard** (v2 multi-shard): one writer per shard; `sender(shard_id)` +/// returns the writer that owns `appendonlydir/shard-{shard_id}/`. +/// +/// Step 2a is additive — this type is defined here but no call site is wired +/// to it yet. Step 2c performs the type plumbing in `conn_state` and +/// `conn/core`; steps 2d/2e/2f update the call sites and spawn paths. +#[derive(Clone)] +pub struct AofWriterPool { + senders: Vec>, + layout: crate::persistence::aof_manifest::AofLayout, +} + +impl AofWriterPool { + /// Build a TopLevel pool from a single existing writer sender. Used for + /// legacy v1 deployments and `--shards 1` v2 deployments where one writer + /// thread services every shard. + pub fn top_level(sender: channel::MpscSender) -> Arc { + Arc::new(Self { + senders: vec![sender], + layout: crate::persistence::aof_manifest::AofLayout::TopLevel, + }) + } + + /// Build a PerShard pool from N senders. `senders[i]` MUST be the writer + /// task that owns `appendonlydir/shard-{i}/`. The vector's length is the + /// shard count; passing a length-1 vector here is a bug — use + /// [`AofWriterPool::top_level`] instead. + pub fn per_shard(senders: Vec>) -> Arc { + debug_assert!( + senders.len() >= 2, + "per_shard pool needs >=2 writers; use top_level for single-writer" + ); + Arc::new(Self { + senders, + layout: crate::persistence::aof_manifest::AofLayout::PerShard, + }) + } + + /// Return the writer sender that owns the given shard's AOF file. + /// + /// For TopLevel pools, `shard_id` is ignored — all shards multiplex onto + /// the single sender. For PerShard pools, `shard_id` MUST be in range + /// `[0, num_writers())`; an out-of-range id is a programmer error and + /// panics in debug builds. + #[inline] + pub fn sender(&self, shard_id: usize) -> &channel::MpscSender { + use crate::persistence::aof_manifest::AofLayout; + match self.layout { + AofLayout::TopLevel => &self.senders[0], + AofLayout::PerShard => { + debug_assert!( + shard_id < self.senders.len(), + "shard_id {} out of range for per-shard pool of size {}", + shard_id, + self.senders.len() + ); + &self.senders[shard_id] + } + } + } + + /// Fire-and-forget append for the given shard, tagged with the LSN that + /// was issued for this write (see [`AofMessage::Append`] docs for LSN + /// semantics per layout). Call sites must source `lsn` from + /// `ReplicationState::issue_lsn(shard_id, bytes.len() as u64)` for writes + /// that participate in replication ordering; sites without a + /// replication-state handle pass 0. + #[inline] + pub fn try_send_append(&self, shard_id: usize, lsn: u64, bytes: Bytes) { + let _ = self + .sender(shard_id) + .try_send(AofMessage::Append { lsn, bytes }); + } + + /// Issue an LSN for an AOF append at every call site that has the + /// `Option>>` shape. Wraps + /// `ReplicationState::issue_lsn` so handler call sites collapse to a + /// single line. + /// + /// Returns 0 when: + /// - `repl_state` is None (test fixtures or shutdown paths) + /// - the `RwLock` is poisoned (shouldn't happen in production — + /// ReplicationState is only `write()`-locked under known-safe paths) + /// + /// 0 is a sentinel meaning "no replication ordering for this write". + /// TopLevel writers ignore the LSN entirely so 0 is harmless there; + /// PerShard writers treat 0 the same as any other LSN (per-shard order + /// is preserved by write order, not by LSN value). The LSN only matters + /// for the cross-shard `OrderedAcrossShards` merge in RFC step 5. + #[inline] + pub fn issue_append_lsn( + repl_state: &Option>>, + shard_id: usize, + delta: usize, + ) -> u64 { + repl_state + .as_ref() + .and_then(|rs| { + rs.read() + .ok() + .map(|g| g.issue_lsn(shard_id, delta as u64)) + }) + .unwrap_or(0) + } + + /// Submit a Rewrite/RewriteSharded message. Only legal for TopLevel pools; + /// PerShard rewrites are per-shard operations and must be initiated by + /// the BGREWRITEAOF code path in step 6, not via this enum variant. + pub fn try_send_rewrite(&self, msg: AofMessage) -> Result<(), AofPoolSendError> { + use crate::persistence::aof_manifest::AofLayout; + debug_assert!( + matches!(msg, AofMessage::Rewrite(_) | AofMessage::RewriteSharded(_)), + "try_send_rewrite called with a non-Rewrite variant", + ); + if self.layout == AofLayout::PerShard { + return Err(AofPoolSendError::RewriteUnsupportedInPerShard); + } + self.senders[0] + .try_send(msg) + .map_err(|_| AofPoolSendError::SendFailed) + } + + /// Broadcast `Shutdown` to every writer. Used by orchestrated shutdown + /// paths in `main.rs`/`embedded.rs`. Each writer drains its channel and + /// fsyncs before exiting. + pub fn broadcast_shutdown(&self) { + for s in &self.senders { + let _ = s.try_send(AofMessage::Shutdown); + } + } + + /// Number of underlying writer senders. 1 for TopLevel, num_shards for + /// PerShard. + #[inline] + pub fn num_writers(&self) -> usize { + self.senders.len() + } + + /// Reports the pool's layout. Useful for places that need to refuse + /// PerShard-incompatible legacy code paths with a clear error. + #[inline] + pub fn layout(&self) -> crate::persistence::aof_manifest::AofLayout { + self.layout + } +} + +#[cfg(test)] +mod pool_tests { + use super::*; + use crate::persistence::aof_manifest::AofLayout; + use crate::runtime::channel; + + #[test] + fn top_level_pool_routes_all_shards_to_writer_zero() { + let (tx, rx) = channel::mpsc_bounded::(8); + let pool = AofWriterPool::top_level(tx); + assert_eq!(pool.num_writers(), 1); + assert_eq!(pool.layout(), AofLayout::TopLevel); + + pool.try_send_append(0, 0, Bytes::from_static(b"a")); + pool.try_send_append(7, 0, Bytes::from_static(b"b")); + pool.try_send_append(42, 0, Bytes::from_static(b"c")); + + let mut seen = 0; + while rx.try_recv().is_ok() { + seen += 1; + } + assert_eq!(seen, 3, "all 3 appends should land on writer 0"); + } + + #[test] + fn per_shard_pool_routes_each_shard_to_its_own_writer() { + let (tx0, rx0) = channel::mpsc_bounded::(8); + let (tx1, rx1) = channel::mpsc_bounded::(8); + let (tx2, rx2) = channel::mpsc_bounded::(8); + let pool = AofWriterPool::per_shard(vec![tx0, tx1, tx2]); + assert_eq!(pool.num_writers(), 3); + assert_eq!(pool.layout(), AofLayout::PerShard); + + pool.try_send_append(0, 100, Bytes::from_static(b"shard0")); + pool.try_send_append(1, 200, Bytes::from_static(b"shard1a")); + pool.try_send_append(1, 300, Bytes::from_static(b"shard1b")); + pool.try_send_append(2, 400, Bytes::from_static(b"shard2")); + + let count = |rx: &channel::MpscReceiver| -> usize { + let mut n = 0; + while rx.try_recv().is_ok() { + n += 1; + } + n + }; + assert_eq!(count(&rx0), 1, "shard 0 writer should receive exactly 1"); + assert_eq!(count(&rx1), 2, "shard 1 writer should receive exactly 2"); + assert_eq!(count(&rx2), 1, "shard 2 writer should receive exactly 1"); + } + + #[test] + fn per_shard_pool_rejects_rewrite_with_explicit_error() { + let (tx0, _rx0) = channel::mpsc_bounded::(8); + let (tx1, _rx1) = channel::mpsc_bounded::(8); + let pool = AofWriterPool::per_shard(vec![tx0, tx1]); + + let dummies: SharedDatabases = Arc::new(vec![]); + let err = pool.try_send_rewrite(AofMessage::Rewrite(dummies)).unwrap_err(); + assert_eq!(err, AofPoolSendError::RewriteUnsupportedInPerShard); + } + + #[test] + fn top_level_pool_accepts_rewrite() { + let (tx, rx) = channel::mpsc_bounded::(8); + let pool = AofWriterPool::top_level(tx); + + let dummies: SharedDatabases = Arc::new(vec![]); + pool.try_send_rewrite(AofMessage::Rewrite(dummies)).unwrap(); + assert!(matches!(rx.try_recv(), Ok(AofMessage::Rewrite(_)))); + } + + #[test] + fn per_shard_pool_threads_lsn_field_to_each_writer() { + // Step 3 wire-format contract: try_send_append carries the issued LSN + // through to the writer task, which writes it as the per-entry header + // under PerShard layout. This unit test pins the channel-side contract + // (the disk-side framing is covered by writer-task integration). + let (tx0, rx0) = channel::mpsc_bounded::(4); + let (tx1, rx1) = channel::mpsc_bounded::(4); + let pool = AofWriterPool::per_shard(vec![tx0, tx1]); + + pool.try_send_append(0, 42, Bytes::from_static(b"set foo 1")); + pool.try_send_append(1, 43, Bytes::from_static(b"set bar 2")); + pool.try_send_append(0, 44, Bytes::from_static(b"del foo")); + + // Shard 0 should see (42, "set foo 1") then (44, "del foo"). + match rx0.try_recv() { + Ok(AofMessage::Append { lsn, bytes }) => { + assert_eq!(lsn, 42, "shard 0 first entry lsn"); + assert_eq!(bytes.as_ref(), b"set foo 1"); + } + other => panic!("shard 0 first recv expected Append, got {:?}", other.is_ok()), + } + match rx0.try_recv() { + Ok(AofMessage::Append { lsn, bytes }) => { + assert_eq!(lsn, 44, "shard 0 second entry lsn"); + assert_eq!(bytes.as_ref(), b"del foo"); + } + other => panic!("shard 0 second recv expected Append, got {:?}", other.is_ok()), + } + // Shard 1 should see (43, "set bar 2") only. + match rx1.try_recv() { + Ok(AofMessage::Append { lsn, bytes }) => { + assert_eq!(lsn, 43, "shard 1 entry lsn"); + assert_eq!(bytes.as_ref(), b"set bar 2"); + } + other => panic!("shard 1 recv expected Append, got {:?}", other.is_ok()), + } + } + + #[test] + fn broadcast_shutdown_reaches_every_writer() { + let (tx0, rx0) = channel::mpsc_bounded::(2); + let (tx1, rx1) = channel::mpsc_bounded::(2); + let (tx2, rx2) = channel::mpsc_bounded::(2); + let pool = AofWriterPool::per_shard(vec![tx0, tx1, tx2]); + + pool.broadcast_shutdown(); + + for (i, rx) in [&rx0, &rx1, &rx2].iter().enumerate() { + assert!( + matches!(rx.try_recv(), Ok(AofMessage::Shutdown)), + "writer {} did not receive Shutdown", + i + ); + } + } +} + /// Serialize a Frame into RESP wire format bytes. pub fn serialize_command(frame: &Frame) -> Bytes { let mut buf = BytesMut::with_capacity(64); @@ -194,7 +501,10 @@ pub async fn aof_writer_task( loop { match rx.recv() { - Ok(AofMessage::Append(data)) => { + // TopLevel writer: legacy v1 disk format is plain RESP. The + // LSN is ignored — TopLevel is single-shard so per-shard merge + // by LSN is moot. + Ok(AofMessage::Append { bytes: data, lsn: _ }) => { if write_error { continue; // Drop appends after persistent I/O failure } @@ -287,7 +597,8 @@ pub async fn aof_writer_task( tokio::select! { msg = rx.recv_async() => { match msg { - Ok(AofMessage::Append(data)) => { + // TopLevel writer (tokio): legacy v1 plain RESP, lsn ignored. + Ok(AofMessage::Append { bytes: data, lsn: _ }) => { if let Err(e) = writer.write_all(&data).await { error!("AOF write error: {}", e); continue; @@ -369,6 +680,376 @@ pub async fn aof_writer_task( } } +/// Background per-shard AOF writer task (Option B step 2b). +/// +/// One instance is spawned per shard in `PerShard` layout. Each instance owns +/// `appendonlydir/shard-{shard_id}/moon.aof.{seq}.incr.aof` exclusively — no +/// other writer touches that file, so there is no per-file locking. +/// +/// Differences from [`aof_writer_task`] (TopLevel): +/// - Opens `manifest.shard_incr_path(shard_id)` instead of `manifest.incr_path()`. +/// - `Rewrite`/`RewriteSharded` variants are rejected (logged + dropped). +/// The legacy single-writer rewrite enum has no meaning when each shard +/// owns its own files; per-shard BGREWRITEAOF lands in RFC step 6. +/// - Refuses to start if the loaded manifest's layout is `TopLevel` — the +/// spawn site (step 2f) must only invoke this task body for `PerShard` +/// layouts. Mismatch is a programmer error. +/// +/// Wait/timeout/corruption semantics for manifest loading match the existing +/// `aof_writer_task` (60s bounded wait, hard fail on corrupt manifest). +pub async fn per_shard_aof_writer_task( + rx: channel::MpscReceiver, + base_dir: PathBuf, + shard_id: u16, + fsync: FsyncPolicy, + cancel: CancellationToken, +) { + #[cfg(feature = "runtime-tokio")] + { + use crate::persistence::aof_manifest::{AofLayout, AofManifest}; + use tokio::io::AsyncWriteExt; + + // Wait for main.rs recovery to create/load the manifest. + let manifest_wait_start = Instant::now(); + const MANIFEST_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60); + let manifest = loop { + if cancel.is_cancelled() { + info!( + "AOF writer shard {}: cancelled while waiting for manifest", + shard_id + ); + return; + } + if manifest_wait_start.elapsed() > MANIFEST_TIMEOUT { + error!( + "AOF writer shard {}: manifest not found at {} after {:?}. Writer exiting.", + shard_id, + base_dir.display(), + MANIFEST_TIMEOUT, + ); + return; + } + match AofManifest::load(&base_dir) { + Ok(Some(m)) => break m, + Ok(None) => { + tokio::time::sleep(std::time::Duration::from_millis(50)).await; + } + Err(e) => { + error!( + "AOF writer shard {}: manifest corrupt at {}: {}. Persistence disabled.", + shard_id, + base_dir.display(), + e + ); + return; + } + } + }; + + if manifest.layout != AofLayout::PerShard { + error!( + "AOF writer shard {}: layout is {:?}, expected PerShard. \ + per_shard_aof_writer_task should only be spawned for PerShard layouts. \ + Writer exiting.", + shard_id, manifest.layout + ); + return; + } + if (shard_id as usize) >= manifest.shards.len() { + error!( + "AOF writer shard {}: out of range for manifest with {} shards. Writer exiting.", + shard_id, + manifest.shards.len() + ); + return; + } + + let incr_path = manifest.shard_incr_path(shard_id); + // Ensure shard-{N}/ exists. The manifest constructor for PerShard + // already creates these, but be defensive — a manual deletion or + // a manifest written by an older binary could leave them missing. + if let Some(parent) = incr_path.parent() { + if let Err(e) = tokio::fs::create_dir_all(parent).await { + error!( + "AOF writer shard {}: failed to create dir {}: {}", + shard_id, + parent.display(), + e + ); + return; + } + } + let file: tokio::fs::File = match tokio::fs::OpenOptions::new() + .create(true) + .append(true) + .open(&incr_path) + .await + { + Ok(f) => f, + Err(e) => { + error!( + "AOF writer shard {}: failed to open incr {}: {}", + shard_id, + incr_path.display(), + e + ); + return; + } + }; + info!( + "AOF writer shard {}: seq {}, incr={}", + shard_id, + manifest.seq, + incr_path.display() + ); + + let mut writer = tokio::io::BufWriter::new(file); + let mut last_fsync = Instant::now(); + let mut interval = tokio::time::interval(std::time::Duration::from_secs(1)); + interval.tick().await; + + loop { + tokio::select! { + msg = rx.recv_async() => { + match msg { + // PerShard writer (tokio): per RFC § 2 Rule 1 the on-disk + // format is `[u64 lsn LE][u32 len LE][RESP bytes]`. Header + // is written sequentially with the body — both calls land + // in the same BufWriter so this is one syscall under load. + Ok(AofMessage::Append { lsn, bytes: data }) => { + let mut header = [0u8; 12]; + header[..8].copy_from_slice(&lsn.to_le_bytes()); + header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes()); + if let Err(e) = writer.write_all(&header).await { + error!("AOF header write error shard {}: {}", shard_id, e); + continue; + } + if let Err(e) = writer.write_all(&data).await { + error!("AOF write error shard {}: {}", shard_id, e); + continue; + } + if matches!(fsync, FsyncPolicy::Always) { + let _ = writer.flush().await; + let _ = writer.get_ref().sync_data().await; + } + } + Ok(AofMessage::Rewrite(_)) | Ok(AofMessage::RewriteSharded(_)) => { + warn!( + "AOF writer shard {}: received Rewrite/RewriteSharded — \ + not supported in PerShard layout, dropped. \ + Per-shard BGREWRITEAOF lands in RFC step 6.", + shard_id + ); + } + Ok(AofMessage::Shutdown) | Err(_) => { + let _ = writer.flush().await; + let _ = writer.get_ref().sync_data().await; + info!("AOF writer shard {} shutting down", shard_id); + break; + } + } + } + _ = interval.tick(), if fsync == FsyncPolicy::EverySec => { + if last_fsync.elapsed() >= std::time::Duration::from_secs(1) { + let _ = writer.flush().await; + let _ = writer.get_ref().sync_data().await; + last_fsync = Instant::now(); + } + } + _ = cancel.cancelled() => { + let _ = writer.flush().await; + let _ = writer.get_ref().sync_data().await; + info!("AOF writer shard {} cancelled", shard_id); + break; + } + } + } + } + + #[cfg(feature = "runtime-monoio")] + { + use crate::persistence::aof_manifest::{AofLayout, AofManifest}; + use std::io::Write; + + let manifest_wait_start = Instant::now(); + const MANIFEST_TIMEOUT: std::time::Duration = std::time::Duration::from_secs(60); + let manifest = loop { + if cancel.is_cancelled() { + info!( + "AOF writer shard {}: cancelled while waiting for manifest", + shard_id + ); + return; + } + if manifest_wait_start.elapsed() > MANIFEST_TIMEOUT { + error!( + "AOF writer shard {}: manifest not found at {} after {:?}. Writer exiting.", + shard_id, + base_dir.display(), + MANIFEST_TIMEOUT, + ); + return; + } + match AofManifest::load(&base_dir) { + Ok(Some(m)) => break m, + Ok(None) => { + std::thread::sleep(std::time::Duration::from_millis(50)); + } + Err(e) => { + error!( + "AOF writer shard {}: manifest corrupt at {}: {}. Persistence disabled.", + shard_id, + base_dir.display(), + e + ); + return; + } + } + }; + + if manifest.layout != AofLayout::PerShard { + error!( + "AOF writer shard {}: layout is {:?}, expected PerShard. Writer exiting.", + shard_id, manifest.layout + ); + return; + } + if (shard_id as usize) >= manifest.shards.len() { + error!( + "AOF writer shard {}: out of range for manifest with {} shards. Writer exiting.", + shard_id, + manifest.shards.len() + ); + return; + } + + let incr_path = manifest.shard_incr_path(shard_id); + if let Some(parent) = incr_path.parent() { + if let Err(e) = std::fs::create_dir_all(parent) { + error!( + "AOF writer shard {}: failed to create dir {}: {}", + shard_id, + parent.display(), + e + ); + return; + } + } + let mut file = match std::fs::OpenOptions::new() + .create(true) + .append(true) + .open(&incr_path) + { + Ok(f) => f, + Err(e) => { + error!( + "AOF writer shard {}: failed to open incr {}: {}", + shard_id, + incr_path.display(), + e + ); + return; + } + }; + info!( + "AOF writer shard {}: seq {}, incr={}", + shard_id, + manifest.seq, + incr_path.display() + ); + + let mut last_fsync = Instant::now(); + let mut write_error = false; + + loop { + match rx.recv() { + // PerShard writer (monoio): framed `[u64 lsn LE][u32 len LE][RESP]`. + // See the tokio twin above for format rationale. + Ok(AofMessage::Append { lsn, bytes: data }) => { + if write_error { + continue; + } + let mut header = [0u8; 12]; + header[..8].copy_from_slice(&lsn.to_le_bytes()); + header[8..].copy_from_slice(&(data.len() as u32).to_le_bytes()); + if let Err(e) = file.write_all(&header) { + error!( + "AOF header write failed shard {} (seq {}): {}. Persistence degraded.", + shard_id, manifest.seq, e + ); + write_error = true; + continue; + } + if let Err(e) = file.write_all(&data) { + error!( + "AOF write failed shard {} (seq {}): {}. Persistence degraded.", + shard_id, manifest.seq, e + ); + write_error = true; + continue; + } + match fsync { + FsyncPolicy::Always => { + let t = Instant::now(); + if let Err(e) = file.flush().and_then(|_| file.sync_data()) { + error!( + "AOF sync failed shard {} (seq {}, always): {}", + shard_id, manifest.seq, e + ); + write_error = true; + } else { + crate::admin::metrics_setup::record_aof_fsync( + t.elapsed().as_micros() as u64, + ); + } + } + FsyncPolicy::EverySec => { + if last_fsync.elapsed() >= std::time::Duration::from_secs(1) { + let t = Instant::now(); + if let Err(e) = file.flush().and_then(|_| file.sync_data()) { + error!( + "AOF sync failed shard {} (seq {}, everysec): {}", + shard_id, manifest.seq, e + ); + } else { + crate::admin::metrics_setup::record_aof_fsync( + t.elapsed().as_micros() as u64, + ); + last_fsync = Instant::now(); + } + } + } + FsyncPolicy::No => {} + } + } + Ok(AofMessage::Rewrite(_)) | Ok(AofMessage::RewriteSharded(_)) => { + warn!( + "AOF writer shard {}: received Rewrite/RewriteSharded — \ + not supported in PerShard layout, dropped. \ + Per-shard BGREWRITEAOF lands in RFC step 6.", + shard_id + ); + } + Ok(AofMessage::Shutdown) | Err(_) => { + if !write_error { + if let Err(e) = file.flush().and_then(|_| file.sync_data()) { + error!( + "AOF final sync failed shard {} (seq {}): {}", + shard_id, manifest.seq, e + ); + } + } + info!( + "AOF writer shard {} shutting down (monoio, seq {})", + shard_id, manifest.seq + ); + break; + } + } + } + } +} + /// Replay an AOF file by parsing RESP commands and dispatching them. /// /// Returns the number of commands successfully replayed. @@ -789,7 +1470,9 @@ fn drain_pending_appends( let mut outcome = DrainOutcome::default(); while let Ok(msg) = rx.try_recv() { match msg { - AofMessage::Append(data) => { + // BGREWRITEAOF drain runs on the TopLevel writer (monoio) only; + // PerShard rewrite is RFC step 6. Legacy v1 disk format → ignore lsn. + AofMessage::Append { bytes: data, lsn: _ } => { file.write_all(&data).map_err(|e| AofError::Io { path: PathBuf::from(""), source: e, diff --git a/src/persistence/aof_manifest.rs b/src/persistence/aof_manifest.rs index ce1efb3b..c6109d87 100644 --- a/src/persistence/aof_manifest.rs +++ b/src/persistence/aof_manifest.rs @@ -5,17 +5,34 @@ //! manifest framing is the canonical on-disk marker; the human-readable //! "v1" umbrella also covers WAL v3 and RDB v2 sub-formats. //! -//! Implements the same directory-based AOF format as Redis 7+: +//! Two on-disk layouts coexist (selected at manifest creation time, never mixed +//! within one directory): +//! +//! **TopLevel (manifest v1, single-shard / legacy):** //! ```text //! appendonlydir/ //! moon.aof.1.base.rdb # RDB snapshot base //! moon.aof.1.incr.aof # Incremental RESP since base -//! moon.aof.manifest # This file +//! moon.aof.manifest # v1 text format +//! ``` +//! +//! **PerShard (manifest v2, multi-shard durability):** +//! ```text +//! appendonlydir/ +//! moon.aof.manifest # v2 text format (carries shard count + max_lsn) +//! shard-0/ +//! moon.aof.1.base.rdb +//! moon.aof.1.incr.aof +//! shard-1/ +//! moon.aof.1.base.rdb +//! moon.aof.1.incr.aof +//! … //! ``` //! -//! The manifest is a simple text file listing the active base and incremental -//! files with their sequence numbers. On BGREWRITEAOF, the sequence increments, -//! a new base + incr pair is created, and old files are deleted. +//! The manifest text format is line-prefix based. v1 manifests have no +//! `version` line; v2 manifests begin with `version 2`. On BGREWRITEAOF the +//! sequence increments, a new base + incr pair is created per shard (PerShard) +//! or at top level (TopLevel), and old files are deleted. use std::io::Write; use std::path::{Path, PathBuf}; @@ -25,13 +42,44 @@ use tracing::{error, info, warn}; const MANIFEST_NAME: &str = "moon.aof.manifest"; const AOF_DIR_NAME: &str = "appendonlydir"; +/// On-disk layout discriminator. +/// +/// `TopLevel` is the legacy single-shard layout from manifest v1. `PerShard` +/// is the multi-shard layout introduced with manifest v2 — used whenever +/// `num_shards >= 2`. A `--shards 1` deployment with an existing v1 manifest +/// stays TopLevel until explicitly migrated. +#[derive(Debug, Clone, Copy, PartialEq, Eq)] +pub enum AofLayout { + /// Legacy single-shard layout: `appendonlydir/moon.aof.{seq}.{base|incr}.*`. + TopLevel, + /// Per-shard layout: `appendonlydir/shard-{N}/moon.aof.{seq}.{base|incr}.*`. + PerShard, +} + +/// Per-shard manifest entry. One per shard in `PerShard` layout. +#[derive(Debug, Clone, PartialEq, Eq)] +pub struct ShardManifest { + /// Shard ID (0..num_shards). + pub shard_id: u16, + /// Max LSN persisted to this shard's incr file so far. Semantics defined + /// by step 3 (LSN tagging) of the per-shard AOF RFC — until then this is + /// 0 and recovery does not use it. Once step 3 ships, recovery seeds + /// `master_repl_offset = max(shards[*].max_lsn)` before accepting writes. + pub max_lsn: u64, +} + /// Active AOF file set tracked by the manifest. #[derive(Debug, Clone)] pub struct AofManifest { /// Base directory (parent of `appendonlydir/`) pub dir: PathBuf, - /// Current sequence number (incremented on each rewrite) + /// Current sequence number (incremented on each rewrite). pub seq: u64, + /// On-disk layout. Determines path computation for base/incr files. + pub layout: AofLayout, + /// Per-shard metadata. Length is 1 for `TopLevel`, `num_shards` for + /// `PerShard`. Indexed by `shard_id`. + pub shards: Vec, } impl AofManifest { @@ -46,25 +94,76 @@ impl AofManifest { } /// Path to the base RDB file for the current sequence. + /// + /// Layout-aware: TopLevel returns `appendonlydir/moon.aof.{seq}.base.rdb`; + /// PerShard routes to `appendonlydir/shard-0/moon.aof.{seq}.base.rdb`. + /// This single-file helper is meaningful only when there is one shard + /// (post-migration `--shards 1`); a multi-shard PerShard manifest has N + /// base files and the caller must use [`Self::shard_base_path`] instead. + /// In debug builds, calling this on a multi-shard PerShard manifest + /// asserts; in release it returns the shard-0 path so production stays + /// recoverable rather than panicking on a stale call site. pub fn base_path(&self) -> PathBuf { - self.aof_dir() - .join(format!("moon.aof.{}.base.rdb", self.seq)) + match self.layout { + AofLayout::TopLevel => self + .aof_dir() + .join(format!("moon.aof.{}.base.rdb", self.seq)), + AofLayout::PerShard => { + debug_assert!( + self.shards.len() == 1, + "base_path() called on multi-shard PerShard manifest; use shard_base_path(shard_id)", + ); + self.shard_base_path_seq(0, self.seq) + } + } } /// Path to the incremental RESP file for the current sequence. + /// + /// Layout-aware — see [`Self::base_path`] for the same routing rules. pub fn incr_path(&self) -> PathBuf { - self.aof_dir() - .join(format!("moon.aof.{}.incr.aof", self.seq)) + match self.layout { + AofLayout::TopLevel => self + .aof_dir() + .join(format!("moon.aof.{}.incr.aof", self.seq)), + AofLayout::PerShard => { + debug_assert!( + self.shards.len() == 1, + "incr_path() called on multi-shard PerShard manifest; use shard_incr_path(shard_id)", + ); + self.shard_incr_path_seq(0, self.seq) + } + } } - /// Path to the base RDB file for a given sequence. + /// Path to the base RDB file for a given sequence. Layout-aware — see + /// [`Self::base_path`]. pub fn base_path_seq(&self, seq: u64) -> PathBuf { - self.aof_dir().join(format!("moon.aof.{}.base.rdb", seq)) + match self.layout { + AofLayout::TopLevel => self.aof_dir().join(format!("moon.aof.{}.base.rdb", seq)), + AofLayout::PerShard => { + debug_assert!( + self.shards.len() == 1, + "base_path_seq() called on multi-shard PerShard manifest; use shard_base_path_seq(shard_id, seq)", + ); + self.shard_base_path_seq(0, seq) + } + } } - /// Path to the incremental RESP file for a given sequence. + /// Path to the incremental RESP file for a given sequence. Layout-aware — + /// see [`Self::base_path`]. pub fn incr_path_seq(&self, seq: u64) -> PathBuf { - self.aof_dir().join(format!("moon.aof.{}.incr.aof", seq)) + match self.layout { + AofLayout::TopLevel => self.aof_dir().join(format!("moon.aof.{}.incr.aof", seq)), + AofLayout::PerShard => { + debug_assert!( + self.shards.len() == 1, + "incr_path_seq() called on multi-shard PerShard manifest; use shard_incr_path_seq(shard_id, seq)", + ); + self.shard_incr_path_seq(0, seq) + } + } } /// Create the `appendonlydir/` and write the initial manifest. @@ -82,6 +181,11 @@ impl AofManifest { let manifest = Self { dir: dir.to_path_buf(), seq: 1, + layout: AofLayout::TopLevel, + shards: vec![ShardManifest { + shard_id: 0, + max_lsn: 0, + }], }; std::fs::create_dir_all(manifest.aof_dir())?; @@ -119,6 +223,11 @@ impl AofManifest { let manifest = Self { dir: dir.to_path_buf(), seq: 1, + layout: AofLayout::TopLevel, + shards: vec![ShardManifest { + shard_id: 0, + max_lsn: 0, + }], }; std::fs::create_dir_all(manifest.aof_dir())?; @@ -157,6 +266,53 @@ impl AofManifest { let content = std::fs::read_to_string(&manifest_path)?; + // Detect format version. v1 manifests have no `version` line and use + // line prefixes `seq`/`base`/`incr`. v2 manifests start with `version 2` + // and carry per-shard records. + let mut format_version: u8 = 1; + for line in content.lines() { + let line = line.trim(); + if let Some(val) = line.strip_prefix("version ") { + if let Ok(v) = val.parse::() { + format_version = v; + } + break; + } + if !line.is_empty() { + // First non-blank line is not a version header → v1. + break; + } + } + + let manifest = match format_version { + 1 => Self::parse_v1(&content, dir, &manifest_path)?, + 2 => Self::parse_v2(&content, dir, &manifest_path)?, + other => { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has unsupported format version {} (max supported: 2)", + manifest_path.display(), + other, + ), + )); + } + }; + + // Best-effort orphan cleanup: delete stray base/incr files from aborted + // rewrites. A crash between advance() steps 1-3 leaves a new base RDB on + // disk that the active manifest never references. Without this sweep, + // repeated crashes during rewrite can fill the disk with zombie files. + // + // Safe to call here: parse_* verified the manifest has all required + // records, so cleanup_orphans won't delete the active files. + manifest.cleanup_orphans(); + + Ok(Some(manifest)) + } + + /// Parse a v1 (TopLevel, single-shard) manifest. + fn parse_v1(content: &str, dir: &Path, manifest_path: &Path) -> std::io::Result { let mut seq = 0u64; let mut has_base_record = false; let mut has_incr_record = false; @@ -183,10 +339,6 @@ impl AofManifest { )); } - // A valid manifest must have all three records (seq, base, incr). - // A truncated manifest with only "seq N" but no base/incr lines could - // trigger orphan cleanup that deletes the real base RDB referenced by - // the previous valid manifest. Require all records before proceeding. if !has_base_record || !has_incr_record { return Err(std::io::Error::new( std::io::ErrorKind::InvalidData, @@ -200,21 +352,182 @@ impl AofManifest { )); } - let manifest = Self { + Ok(Self { dir: dir.to_path_buf(), seq, - }; + layout: AofLayout::TopLevel, + shards: vec![ShardManifest { + shard_id: 0, + max_lsn: 0, + }], + }) + } - // Best-effort orphan cleanup: delete stray base/incr files from aborted - // rewrites. A crash between advance() steps 1-3 leaves a new base RDB on - // disk that the active manifest never references. Without this sweep, - // repeated crashes during rewrite can fill the disk with zombie files. - // - // Safe to call here: we verified the manifest has all three records - // (seq, base, incr), so cleanup_orphans won't delete the active files. - manifest.cleanup_orphans(); + /// Parse a v2 (PerShard, multi-shard) manifest. + /// + /// Expected line format: + /// ```text + /// version 2 + /// seq N + /// shards K + /// shard 0 max_lsn LSN0 + /// shard 1 max_lsn LSN1 + /// ... + /// ``` + /// + /// Per-shard `base`/`incr` paths are derived from `shard-{N}/moon.aof.{seq}.*` + /// rather than stored explicitly — the layout is canonical, so storing + /// paths invites drift between the stored value and the computed one. + fn parse_v2(content: &str, dir: &Path, manifest_path: &Path) -> std::io::Result { + let mut seq = 0u64; + let mut num_shards: Option = None; + let mut shards: Vec = Vec::new(); - Ok(Some(manifest)) + for line in content.lines() { + let line = line.trim(); + if line.is_empty() || line.starts_with('#') { + continue; + } + if line == "version 2" { + continue; + } else if let Some(val) = line.strip_prefix("seq ") { + seq = val.parse::().map_err(|e| { + std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has invalid seq line `{}`: {}", + manifest_path.display(), + line, + e, + ), + ) + })?; + } else if let Some(val) = line.strip_prefix("shards ") { + num_shards = Some(val.parse::().map_err(|e| { + std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has invalid shards line `{}`: {}", + manifest_path.display(), + line, + e, + ), + ) + })?); + } else if let Some(rest) = line.strip_prefix("shard ") { + // Format: `shard max_lsn ` + let mut it = rest.split_whitespace(); + let id_str = it.next().ok_or_else(|| { + std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has shard line missing id: `{}`", + manifest_path.display(), + line, + ), + ) + })?; + let id: u16 = id_str.parse().map_err(|e| { + std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has shard line invalid id `{}`: {}", + manifest_path.display(), + id_str, + e, + ), + ) + })?; + // Expect `max_lsn `. + let label = it.next().unwrap_or(""); + let val_str = it.next().unwrap_or("0"); + if label != "max_lsn" { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} shard {} expected `max_lsn`, got `{}`", + manifest_path.display(), + id, + label, + ), + )); + } + let max_lsn: u64 = val_str.parse().map_err(|e| { + std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} shard {} invalid max_lsn `{}`: {}", + manifest_path.display(), + id, + val_str, + e, + ), + ) + })?; + shards.push(ShardManifest { + shard_id: id, + max_lsn, + }); + } + // Unknown lines are tolerated (forward-compat). Strict parsers can + // be added at v3 if needed. + } + + if seq == 0 { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has no valid sequence number", + manifest_path.display() + ), + )); + } + + let expected = num_shards.ok_or_else(|| { + std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} is missing required `shards N` line", + manifest_path.display() + ), + ) + })?; + + if shards.len() != expected as usize { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} declares shards={} but has {} shard records", + manifest_path.display(), + expected, + shards.len(), + ), + )); + } + + // Sort by shard_id and verify contiguous range [0, expected). + shards.sort_by_key(|s| s.shard_id); + for (i, s) in shards.iter().enumerate() { + if s.shard_id as usize != i { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidData, + format!( + "AOF manifest at {} has non-contiguous shard ids (expected {} at position {}, got {})", + manifest_path.display(), + i, + i, + s.shard_id, + ), + )); + } + } + + Ok(Self { + dir: dir.to_path_buf(), + seq, + layout: AofLayout::PerShard, + shards, + }) } /// Delete any base/incr files in `appendonlydir/` that do not match the @@ -258,14 +571,33 @@ impl AofManifest { } /// Write the manifest file atomically (write tmp + rename). + /// + /// Emits v1 format for `TopLevel` and v2 for `PerShard`. The format is + /// selected by `self.layout`, never by callers — preserving the invariant + /// that one directory holds one layout. pub fn write_manifest(&self) -> std::io::Result<()> { let manifest_path = self.manifest_path(); let tmp_path = manifest_path.with_extension("tmp"); - let content = format!( - "seq {}\nbase moon.aof.{}.base.rdb\nincr moon.aof.{}.incr.aof\n", - self.seq, self.seq, self.seq - ); + let content = match self.layout { + AofLayout::TopLevel => format!( + "seq {}\nbase moon.aof.{}.base.rdb\nincr moon.aof.{}.incr.aof\n", + self.seq, self.seq, self.seq + ), + AofLayout::PerShard => { + let mut s = String::with_capacity(64 + self.shards.len() * 40); + s.push_str("version 2\n"); + s.push_str(&format!("seq {}\n", self.seq)); + s.push_str(&format!("shards {}\n", self.shards.len())); + for shard in &self.shards { + s.push_str(&format!( + "shard {} max_lsn {}\n", + shard.shard_id, shard.max_lsn + )); + } + s + } + }; let mut f = std::fs::File::create(&tmp_path)?; f.write_all(content.as_bytes())?; @@ -274,6 +606,291 @@ impl AofManifest { Ok(()) } + // ------------------------------------------------------------------ + // Per-shard layout helpers + // ------------------------------------------------------------------ + + /// Directory holding a shard's AOF files. + /// + /// - `TopLevel`: `appendonlydir/` (the shard_id argument is asserted to be 0). + /// - `PerShard`: `appendonlydir/shard-{shard_id}/`. + pub fn shard_dir(&self, shard_id: u16) -> PathBuf { + match self.layout { + AofLayout::TopLevel => { + debug_assert_eq!(shard_id, 0, "TopLevel layout only has shard 0"); + self.aof_dir() + } + AofLayout::PerShard => self.aof_dir().join(format!("shard-{}", shard_id)), + } + } + + /// Path to a shard's base RDB file for the current sequence. + pub fn shard_base_path(&self, shard_id: u16) -> PathBuf { + self.shard_dir(shard_id) + .join(format!("moon.aof.{}.base.rdb", self.seq)) + } + + /// Path to a shard's incremental RESP file for the current sequence. + pub fn shard_incr_path(&self, shard_id: u16) -> PathBuf { + self.shard_dir(shard_id) + .join(format!("moon.aof.{}.incr.aof", self.seq)) + } + + /// Path to a shard's base RDB file for a given sequence. + pub fn shard_base_path_seq(&self, shard_id: u16, seq: u64) -> PathBuf { + self.shard_dir(shard_id) + .join(format!("moon.aof.{}.base.rdb", seq)) + } + + /// Path to a shard's incremental RESP file for a given sequence. + pub fn shard_incr_path_seq(&self, shard_id: u16, seq: u64) -> PathBuf { + self.shard_dir(shard_id) + .join(format!("moon.aof.{}.incr.aof", seq)) + } + + /// Maximum LSN persisted across all shards. + /// + /// Computed (not stored) so the stored value can never drift from + /// the per-shard records. Returns 0 if `shards` is empty (defensive; + /// constructors guarantee at least one shard). + pub fn global_max_lsn(&self) -> u64 { + self.shards.iter().map(|s| s.max_lsn).max().unwrap_or(0) + } + + /// Verify that the manifest matches the runtime shard count. + /// + /// Returns the verbatim error from RFC § 3 if the shard count differs, + /// for operator-facing consistency. Callers (typically `main.rs` boot) + /// should treat this as fatal: continuing with a mismatched shard count + /// silently drops data from shards that no longer exist or replays a + /// shard's data into the wrong DashTable. + pub fn verify_shard_count(&self, expected: u16) -> Result<(), String> { + let actual = self.shards.len() as u16; + if actual != expected { + return Err(format!( + "ERR shard count changed (manifest={}, config={}); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md", + actual, expected + )); + } + Ok(()) + } + + /// Returns true if the on-disk layout under `appendonlydir/` matches the + /// legacy TopLevel format (files at top level, no `shard-N/` subdirs). + /// + /// Used by callers to detect when a v1 single-shard deployment is being + /// upgraded to v2 multi-shard and needs explicit migration. Does NOT + /// migrate — separate from `migrate_top_level_to_per_shard` so the side + /// effect is opt-in, not hidden in a load path. + pub fn is_legacy_top_level_layout(dir: &Path) -> bool { + let aof_dir = dir.join(AOF_DIR_NAME); + if !aof_dir.exists() { + return false; + } + let entries = match std::fs::read_dir(&aof_dir) { + Ok(e) => e, + Err(_) => return false, + }; + for entry in entries.flatten() { + let name = entry.file_name(); + let Some(name_str) = name.to_str() else { + continue; + }; + if name_str.starts_with("moon.aof.") + && (name_str.ends_with(".base.rdb") || name_str.ends_with(".incr.aof")) + { + return true; + } + } + false + } + + /// Migrate a single-shard TopLevel layout in place to a single-shard + /// PerShard layout. + /// + /// Moves `appendonlydir/moon.aof.{seq}.{base.rdb,incr.aof}` into + /// `appendonlydir/shard-0/`, then rewrites the manifest as v2 with + /// `shards 1`. Idempotent: a second call on an already-PerShard manifest + /// returns Ok with no filesystem changes. + /// + /// This is the RFC § 5 case 1 migration — zero data movement (rename only), + /// safe to run on first boot after upgrading from v0.1.x. Multi-shard + /// migrations from legacy AOF (case 2) use the `moon migrate-aof` + /// subcommand and are NOT handled here. + pub fn migrate_top_level_to_per_shard(&mut self) -> std::io::Result<()> { + if self.layout == AofLayout::PerShard { + return Ok(()); + } + if self.shards.len() != 1 { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidInput, + format!( + "migrate_top_level_to_per_shard called with {} shards; \ + only single-shard TopLevel can be migrated in place", + self.shards.len() + ), + )); + } + + // Compute paths up front. shard_dir/shard_*_path_seq for a single- + // shard target are pure path computations and do NOT depend on + // self.layout, so it is safe to derive them while layout is still + // TopLevel. + let old_base = self.aof_dir().join(format!("moon.aof.{}.base.rdb", self.seq)); + let old_incr = self.aof_dir().join(format!("moon.aof.{}.incr.aof", self.seq)); + let new_dir = self.aof_dir().join("shard-0"); + let new_base = new_dir.join(format!("moon.aof.{}.base.rdb", self.seq)); + let new_incr = new_dir.join(format!("moon.aof.{}.incr.aof", self.seq)); + + if !old_base.exists() { + // Pre-flight check: nothing moved yet, no rollback needed. + return Err(std::io::Error::new( + std::io::ErrorKind::NotFound, + format!( + "TopLevel→PerShard migration: source base {} not found", + old_base.display() + ), + )); + } + std::fs::create_dir_all(&new_dir)?; + + // Move base. If this fails, no on-disk mutation happened yet — bail + // without rollback. Layout stays TopLevel until commit at the bottom. + std::fs::rename(&old_base, &new_base)?; + + // Base is now in shard-0/. Any subsequent error must restore it. + let moved_incr: bool; + let created_incr: bool; + if old_incr.exists() { + if let Err(e) = std::fs::rename(&old_incr, &new_incr) { + if let Err(re) = std::fs::rename(&new_base, &old_base) { + error!( + "Migration rollback: failed to restore base {} → {}: {}", + new_base.display(), + old_base.display(), + re + ); + } + return Err(e); + } + moved_incr = true; + created_incr = false; + } else { + match std::fs::File::create(&new_incr) { + Ok(_) => { + moved_incr = false; + created_incr = true; + } + Err(e) => { + if let Err(re) = std::fs::rename(&new_base, &old_base) { + error!( + "Migration rollback: failed to restore base {} → {}: {}", + new_base.display(), + old_base.display(), + re + ); + } + return Err(e); + } + } + } + + // Commit: flip layout, persist as v2. If write_manifest fails, undo + // every filesystem mutation and restore layout so the next boot still + // sees a valid v1 TopLevel deployment. + self.layout = AofLayout::PerShard; + if let Err(e) = self.write_manifest() { + self.layout = AofLayout::TopLevel; + if moved_incr { + if let Err(re) = std::fs::rename(&new_incr, &old_incr) { + error!( + "Migration rollback: failed to restore incr {} → {}: {}", + new_incr.display(), + old_incr.display(), + re + ); + } + } else if created_incr { + if let Err(re) = std::fs::remove_file(&new_incr) { + warn!( + "Migration rollback: failed to remove freshly created incr {}: {}", + new_incr.display(), + re + ); + } + } + if let Err(re) = std::fs::rename(&new_base, &old_base) { + error!( + "Migration rollback: failed to restore base {} → {}: {}. \ + Manifest dir {} may be in an inconsistent state.", + new_base.display(), + old_base.display(), + re, + self.dir.display() + ); + } + return Err(e); + } + + info!( + "AOF migrated: TopLevel → PerShard (single shard) at {}", + self.aof_dir().display() + ); + Ok(()) + } + + /// Create the `appendonlydir/` and write an initial v2 manifest for the + /// given shard count. + /// + /// Each shard gets its own `shard-{N}/` subdirectory with an empty base + /// RDB and an empty incr file. Mirrors `initialize()` semantics: the + /// `(base + incr)` invariant holds from the first boot, so recovery can + /// replay incr-only state without complaint. + pub fn initialize_multi(dir: &Path, num_shards: u16) -> std::io::Result { + if num_shards == 0 { + return Err(std::io::Error::new( + std::io::ErrorKind::InvalidInput, + "initialize_multi requires num_shards >= 1", + )); + } + let manifest = Self { + dir: dir.to_path_buf(), + seq: 1, + layout: AofLayout::PerShard, + shards: (0..num_shards) + .map(|id| ShardManifest { + shard_id: id, + max_lsn: 0, + }) + .collect(), + }; + std::fs::create_dir_all(manifest.aof_dir())?; + + // Per-shard empty RDB. Single Database::default() inside a 1-element + // slice matches `initialize()`'s empty-RDB shape for each shard. + let empty_dbs: [crate::storage::Database; 0] = []; + let empty_rdb = crate::persistence::rdb::save_to_bytes(&empty_dbs) + .map_err(|e| std::io::Error::other(format!("empty RDB serialize: {e}")))?; + + for shard_id in 0..num_shards { + let shard_dir = manifest.shard_dir(shard_id); + std::fs::create_dir_all(&shard_dir)?; + + let base_path = manifest.shard_base_path(shard_id); + let tmp_path = base_path.with_extension("rdb.tmp"); + { + let mut f = std::fs::File::create(&tmp_path)?; + f.write_all(&empty_rdb)?; + f.sync_data()?; + } + std::fs::rename(&tmp_path, &base_path)?; + std::fs::File::create(manifest.shard_incr_path(shard_id))?; + } + + manifest.write_manifest()?; + Ok(manifest) + } + /// Advance to the next sequence: write new base RDB, create new incr file, /// update manifest, delete old files. /// @@ -517,3 +1134,301 @@ fn replay_incr_resp( Ok(count) } + +#[cfg(test)] +mod tests_v2 { + //! Unit tests for the v2 (PerShard) manifest format. + //! + //! Covers the Step 1 deliverable of the per-shard AOF RFC: + //! - v1 manifests continue to load as TopLevel (single-shard, shard_id=0) + //! - v2 round-trip: write → load → equivalent struct shape + //! - shard count mismatch produces the verbatim RFC § 3 error + //! - migrate_top_level_to_per_shard performs in-place rename and rewrites + //! the manifest as v2 + //! - global_max_lsn computes max across shards + //! - is_legacy_top_level_layout detects top-level files + + use super::*; + use std::fs; + + fn temp_dir() -> PathBuf { + let d = std::env::temp_dir().join(format!( + "moon-aof-manifest-test-{}-{}", + std::process::id(), + std::time::SystemTime::now() + .duration_since(std::time::UNIX_EPOCH) + .map(|d| d.as_nanos()) + .unwrap_or(0) + )); + fs::create_dir_all(&d).expect("temp dir create"); + d + } + + #[test] + fn v1_manifest_loads_as_top_level_single_shard() { + let dir = temp_dir(); + let m = AofManifest::initialize(&dir).expect("initialize v1"); + + assert_eq!(m.layout, AofLayout::TopLevel); + assert_eq!(m.shards.len(), 1); + assert_eq!(m.shards[0].shard_id, 0); + assert_eq!(m.shards[0].max_lsn, 0); + + // Reload from disk + let reloaded = AofManifest::load(&dir).expect("load").expect("present"); + assert_eq!(reloaded.layout, AofLayout::TopLevel); + assert_eq!(reloaded.shards.len(), 1); + assert_eq!(reloaded.seq, m.seq); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn v2_manifest_round_trips() { + let dir = temp_dir(); + let m = AofManifest::initialize_multi(&dir, 4).expect("initialize_multi"); + + assert_eq!(m.layout, AofLayout::PerShard); + assert_eq!(m.shards.len(), 4); + for (i, s) in m.shards.iter().enumerate() { + assert_eq!(s.shard_id, i as u16); + assert_eq!(s.max_lsn, 0); + } + + // Per-shard subdirs were created with empty base + incr. + for i in 0..4u16 { + assert!(m.shard_dir(i).exists(), "shard-{} dir exists", i); + assert!(m.shard_base_path(i).exists(), "shard-{} base exists", i); + assert!(m.shard_incr_path(i).exists(), "shard-{} incr exists", i); + } + + let reloaded = AofManifest::load(&dir).expect("load").expect("present"); + assert_eq!(reloaded.layout, AofLayout::PerShard); + assert_eq!(reloaded.shards.len(), 4); + assert_eq!(reloaded.seq, m.seq); + for (i, s) in reloaded.shards.iter().enumerate() { + assert_eq!(s.shard_id, i as u16); + } + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn verify_shard_count_emits_rfc_error_verbatim() { + let m = AofManifest { + dir: PathBuf::from("/tmp/nowhere"), + seq: 1, + layout: AofLayout::PerShard, + shards: vec![ + ShardManifest { shard_id: 0, max_lsn: 0 }, + ShardManifest { shard_id: 1, max_lsn: 0 }, + ], + }; + let err = m.verify_shard_count(4).expect_err("should mismatch"); + assert_eq!( + err, + "ERR shard count changed (manifest=2, config=4); refusing to start to avoid data loss. See docs/runbooks/shard-count-change.md" + ); + + // Matching count succeeds. + m.verify_shard_count(2).expect("match"); + } + + #[test] + fn migrate_top_level_to_per_shard_moves_files_and_rewrites_manifest() { + let dir = temp_dir(); + let mut m = AofManifest::initialize(&dir).expect("initialize v1"); + + // Write a marker into the incr file so we can prove the contents + // survive the rename. + let original_incr = m.aof_dir().join(format!("moon.aof.{}.incr.aof", m.seq)); + fs::write(&original_incr, b"MARKER").expect("write incr marker"); + + m.migrate_top_level_to_per_shard().expect("migrate"); + + assert_eq!(m.layout, AofLayout::PerShard); + assert!(!original_incr.exists(), "old incr removed by rename"); + let new_incr = m.shard_incr_path(0); + assert!(new_incr.exists(), "new shard-0 incr exists"); + let contents = fs::read(&new_incr).expect("read new incr"); + assert_eq!(contents, b"MARKER", "incr contents preserved"); + + // Reloaded manifest is v2. + let reloaded = AofManifest::load(&dir).expect("load").expect("present"); + assert_eq!(reloaded.layout, AofLayout::PerShard); + assert_eq!(reloaded.shards.len(), 1); + + // Idempotency: second call is a no-op. + m.migrate_top_level_to_per_shard().expect("idempotent"); + assert_eq!(m.layout, AofLayout::PerShard); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn global_max_lsn_returns_max_across_shards() { + let m = AofManifest { + dir: PathBuf::from("/tmp/nowhere"), + seq: 1, + layout: AofLayout::PerShard, + shards: vec![ + ShardManifest { shard_id: 0, max_lsn: 100 }, + ShardManifest { shard_id: 1, max_lsn: 500 }, + ShardManifest { shard_id: 2, max_lsn: 250 }, + ], + }; + assert_eq!(m.global_max_lsn(), 500); + } + + #[test] + fn is_legacy_top_level_layout_detects_v1_files() { + let dir = temp_dir(); + // No appendonlydir yet → false. + assert!(!AofManifest::is_legacy_top_level_layout(&dir)); + + // After v1 initialize, top-level files present → true. + let _m = AofManifest::initialize(&dir).expect("init v1"); + assert!(AofManifest::is_legacy_top_level_layout(&dir)); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn is_legacy_top_level_layout_returns_false_for_v2() { + let dir = temp_dir(); + let _m = AofManifest::initialize_multi(&dir, 2).expect("init v2"); + assert!( + !AofManifest::is_legacy_top_level_layout(&dir), + "v2 layout has no top-level moon.aof.* files" + ); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn parse_v2_rejects_shard_count_mismatch_in_file() { + let dir = temp_dir(); + let aof = dir.join(AOF_DIR_NAME); + fs::create_dir_all(&aof).unwrap(); + // Manifest claims shards 3 but only declares two shard records. + fs::write( + aof.join(MANIFEST_NAME), + "version 2\nseq 1\nshards 3\nshard 0 max_lsn 0\nshard 1 max_lsn 0\n", + ) + .unwrap(); + + let err = AofManifest::load(&dir).expect_err("should reject"); + let msg = err.to_string(); + assert!( + msg.contains("declares shards=3 but has 2 shard records"), + "got: {}", + msg + ); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn parse_v2_rejects_non_contiguous_shard_ids() { + let dir = temp_dir(); + let aof = dir.join(AOF_DIR_NAME); + fs::create_dir_all(&aof).unwrap(); + // shards=2 but ids are {0, 2} not {0, 1}. + fs::write( + aof.join(MANIFEST_NAME), + "version 2\nseq 1\nshards 2\nshard 0 max_lsn 0\nshard 2 max_lsn 0\n", + ) + .unwrap(); + + let err = AofManifest::load(&dir).expect_err("should reject"); + let msg = err.to_string(); + assert!(msg.contains("non-contiguous shard ids"), "got: {}", msg); + + fs::remove_dir_all(&dir).ok(); + } + + // ------------------------------------------------------------------ + // Reviewer-flagged fixes: layout-aware path helpers + migration + // rollback. See the "Verify findings against current code" review + // comment on aof_manifest.rs:669-775 and :688-717. + // ------------------------------------------------------------------ + + #[test] + fn base_incr_paths_route_to_shard_zero_after_migration() { + let dir = temp_dir(); + let mut m = AofManifest::initialize(&dir).expect("init v1"); + // Pre-migration: TopLevel paths under appendonlydir/ directly. + assert_eq!(m.base_path(), m.aof_dir().join("moon.aof.1.base.rdb")); + assert_eq!(m.incr_path(), m.aof_dir().join("moon.aof.1.incr.aof")); + + m.migrate_top_level_to_per_shard().expect("migrate"); + + // Post-migration: single-file helpers must route to shard-0/ so + // replay_multi_part and advance() find the actual files. This is + // the bug the reviewer flagged for aof_manifest.rs:669-775. + let shard0 = m.aof_dir().join("shard-0"); + assert_eq!(m.base_path(), shard0.join("moon.aof.1.base.rdb")); + assert_eq!(m.incr_path(), shard0.join("moon.aof.1.incr.aof")); + assert_eq!(m.base_path_seq(7), shard0.join("moon.aof.7.base.rdb")); + assert_eq!(m.incr_path_seq(7), shard0.join("moon.aof.7.incr.aof")); + // The path the helper returns must be where the file actually lives. + assert!(m.base_path().exists(), "base file at returned path"); + assert!(m.incr_path().exists(), "incr file at returned path"); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn migrate_rolls_back_filesystem_when_incr_rename_fails() { + // Simulate the rename(old_incr → new_incr) failure path by making + // the destination already exist as a directory (rename onto a + // non-empty directory is an error on every supported OS). + let dir = temp_dir(); + let mut m = AofManifest::initialize(&dir).expect("init v1"); + let original_base = m.aof_dir().join("moon.aof.1.base.rdb"); + let original_incr = m.aof_dir().join("moon.aof.1.incr.aof"); + fs::write(&original_incr, b"INCR_MARKER").expect("seed incr"); + let base_bytes_before = fs::read(&original_base).expect("read base"); + + // Pre-create shard-0/moon.aof.1.incr.aof as a DIRECTORY so the + // rename fails after the base rename has already succeeded. + let shard0 = m.aof_dir().join("shard-0"); + fs::create_dir_all(shard0.join("moon.aof.1.incr.aof")).expect("seed blocker"); + + let err = m + .migrate_top_level_to_per_shard() + .expect_err("incr rename should fail"); + let _ = err; // exact error kind depends on OS + + // Rollback invariants: + // 1. Layout stays TopLevel in memory. + // 2. base file restored to its original TopLevel path. + // 3. base file contents unchanged. + // 4. on-disk manifest is still v1 (load returns layout TopLevel). + assert_eq!(m.layout, AofLayout::TopLevel, "in-memory layout reverted"); + assert!(original_base.exists(), "base restored to top-level"); + let base_bytes_after = fs::read(&original_base).expect("read base"); + assert_eq!(base_bytes_after, base_bytes_before, "base contents intact"); + let reloaded = AofManifest::load(&dir).expect("load").expect("present"); + assert_eq!(reloaded.layout, AofLayout::TopLevel, "on-disk manifest v1"); + + fs::remove_dir_all(&dir).ok(); + } + + #[test] + fn migrate_does_not_mutate_on_missing_base() { + let dir = temp_dir(); + let mut m = AofManifest::initialize(&dir).expect("init v1"); + let base = m.aof_dir().join("moon.aof.1.base.rdb"); + fs::remove_file(&base).expect("remove base"); + + let err = m + .migrate_top_level_to_per_shard() + .expect_err("missing base should fail"); + assert_eq!(err.kind(), std::io::ErrorKind::NotFound); + // Layout never flipped, no rollback needed. + assert_eq!(m.layout, AofLayout::TopLevel); + + fs::remove_dir_all(&dir).ok(); + } +} diff --git a/src/replication/state.rs b/src/replication/state.rs index e5e357c0..1a596faa 100644 --- a/src/replication/state.rs +++ b/src/replication/state.rs @@ -113,10 +113,34 @@ impl ReplicationState { /// Increment the offset for the given shard by delta bytes. /// Also adds delta to master_repl_offset. pub fn increment_shard_offset(&self, shard_id: usize, delta: u64) { - if shard_id < self.shard_offsets.len() { - self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed); - self.master_repl_offset.fetch_add(delta, Ordering::Relaxed); + let _ = self.issue_lsn(shard_id, delta); + } + + /// Atomically issue an LSN for a write and advance per-shard + + /// master replication offsets by `delta`. + /// + /// Returns the LSN that uniquely identifies this write — equal to the + /// value of `master_repl_offset` BEFORE the increment, mirroring Redis's + /// `+ delta - delta` semantics. The same LSN MUST tag the corresponding + /// `AofMessage::Append` entry and the replication backlog entry for that + /// write so per-shard AOF replay can rebuild a globally consistent log + /// (per-shard AOF RFC § 2 Rule 3). + /// + /// Atomicity caveat: the per-shard offset advance and the master offset + /// advance are TWO separate `fetch_add`s, not one composite op. Concurrent + /// callers across shards observe a brief window where the master sum + /// disagrees with the sum of shard offsets. Acceptable today because the + /// only `total_offset()` consumer is INFO replication, which tolerates + /// transient skew. Do not promote to a hard invariant without redesign. + /// + /// Returns 0 if `shard_id` is out of range (defensive; production callers + /// must pass a valid id). + pub fn issue_lsn(&self, shard_id: usize, delta: u64) -> u64 { + if shard_id >= self.shard_offsets.len() { + return 0; } + self.shard_offsets[shard_id].fetch_add(delta, Ordering::Relaxed); + self.master_repl_offset.fetch_add(delta, Ordering::Relaxed) } /// Returns sum of all per-shard offsets. diff --git a/src/server/conn/blocking.rs b/src/server/conn/blocking.rs index 8846d0c8..7cf92727 100644 --- a/src/server/conn/blocking.rs +++ b/src/server/conn/blocking.rs @@ -1135,7 +1135,10 @@ pub(crate) fn try_inline_dispatch( shard_databases: &std::sync::Arc, shard_id: usize, selected_db: usize, - aof_tx: &Option>, + aof_pool: &Option>, + repl_state: &Option< + std::sync::Arc>, + >, now_ms: u64, num_shards: usize, can_inline_writes: bool, @@ -1346,8 +1349,20 @@ pub(crate) fn try_inline_dispatch( } // AOF: reuse the frozen RESP bytes directly (Arc clone, zero-copy). - if let Some(tx) = aof_tx { - let _ = tx.try_send(crate::persistence::aof::AofMessage::Append(frozen)); + // This path is monoio inline GET/SET — the writer for the local shard + // (shard_id) owns the AOF record; under PerShard layout that routes + // to shard_id's writer. LSN must be sourced from `repl_state` so the + // inline path's writes share an LSN namespace with the non-inline path + // — otherwise per-shard replay merge in RFC step 5 would see two + // disjoint LSN streams per shard. Cost: one extra read-lock acquire + // (uncontended) + one atomic fetch_add per inline SET. + if let Some(pool) = aof_pool { + let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn( + repl_state, + shard_id, + frozen.len(), + ); + pool.try_send_append(shard_id, lsn, frozen); } write_buf.extend_from_slice(b"+OK\r\n"); @@ -1363,7 +1378,10 @@ pub(crate) fn try_inline_dispatch_loop( shard_databases: &std::sync::Arc, shard_id: usize, selected_db: usize, - aof_tx: &Option>, + aof_pool: &Option>, + repl_state: &Option< + std::sync::Arc>, + >, now_ms: u64, num_shards: usize, can_inline_writes: bool, @@ -1377,7 +1395,8 @@ pub(crate) fn try_inline_dispatch_loop( shard_databases, shard_id, selected_db, - aof_tx, + aof_pool, + repl_state, now_ms, num_shards, can_inline_writes, diff --git a/src/server/conn/core.rs b/src/server/conn/core.rs index 43cef452..dc2e2365 100644 --- a/src/server/conn/core.rs +++ b/src/server/conn/core.rs @@ -20,7 +20,7 @@ use std::sync::Arc; use crate::acl::{AclLog, AclTable}; use crate::blocking::BlockingRegistry; use crate::config::{RuntimeConfig, ServerConfig}; -use crate::persistence::aof::AofMessage; +use crate::persistence::aof::AofWriterPool; use crate::protocol::Frame; use crate::pubsub::PubSubRegistry; use crate::runtime::channel; @@ -47,7 +47,13 @@ pub(crate) struct ConnectionContext { pub pubsub_registry: Arc>, pub blocking_registry: Rc>, pub requirepass: Option, - pub aof_tx: Option>, + /// AOF writer pool — the **sole AOF interface** after the 2d/2e migration + /// sequence. Built by spawn sites in `shard/conn_accept.rs` from the + /// on-disk manifest layout: TopLevel wraps a single shared writer, + /// PerShard owns one sender per shard. `try_send_append(shard_id, bytes)` + /// routes to the owning shard; `try_send_rewrite(msg)` rejects under + /// PerShard until per-shard rewrite ships (step 6 of the RFC). + pub aof_pool: Option>, pub tracking_table: Rc>, pub repl_state: Option>>, /// Lock-free mirror of `repl_state.role == Replica { .. }`. @@ -96,7 +102,7 @@ impl ConnectionContext { pubsub_registry: Arc>, blocking_registry: Rc>, requirepass: Option, - aof_tx: Option>, + aof_pool: Option>, tracking_table: Rc>, repl_state: Option>>, cluster_state: Option>>, @@ -136,7 +142,7 @@ impl ConnectionContext { pubsub_registry, blocking_registry, requirepass, - aof_tx, + aof_pool, tracking_table, repl_state, is_replica_mirror, diff --git a/src/server/conn/handler_monoio/dispatch.rs b/src/server/conn/handler_monoio/dispatch.rs index 7182cd68..b4e5485f 100644 --- a/src/server/conn/handler_monoio/dispatch.rs +++ b/src/server/conn/handler_monoio/dispatch.rs @@ -978,9 +978,9 @@ pub(super) fn try_handle_persistence( return true; } if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") { - if let Some(ref tx) = ctx.aof_tx { + if let Some(ref pool) = ctx.aof_pool { responses.push(crate::command::persistence::bgrewriteaof_start_sharded( - tx, + pool, ctx.shard_databases.clone(), )); } else { diff --git a/src/server/conn/handler_monoio/mod.rs b/src/server/conn/handler_monoio/mod.rs index 72eb1413..f2d8debb 100644 --- a/src/server/conn/handler_monoio/mod.rs +++ b/src/server/conn/handler_monoio/mod.rs @@ -20,7 +20,7 @@ use std::rc::Rc; use crate::command::metadata; use crate::command::{DispatchResult, dispatch, dispatch_read, is_dispatch_read_supported}; -use crate::persistence::aof::{self, AofMessage}; +use crate::persistence::aof; use crate::protocol::Frame; use crate::shard::dispatch::key_to_shard; use crate::shard::mesh::ChannelMesh; @@ -483,7 +483,8 @@ pub(crate) async fn handle_connection_sharded_monoio< &ctx.shard_databases, ctx.shard_id, conn.selected_db, - &ctx.aof_tx, + &ctx.aof_pool, + &ctx.repl_state, ctx.cached_clock.ms(), ctx.num_shards, can_inline_writes, @@ -1066,7 +1067,7 @@ pub(crate) async fn handle_connection_sharded_monoio< } // Pre-classify write commands for AOF + tracking - let is_write = if ctx.aof_tx.is_some() || conn.tracking_state.enabled { + let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false @@ -1121,9 +1122,14 @@ pub(crate) async fn handle_connection_sharded_monoio< }; // AOF only on actual success (:1). Matches handler_single. if matches!(response, Frame::Integer(1)) { - if let Some(ref tx) = ctx.aof_tx { + if let Some(ref pool) = ctx.aof_pool { let serialized = aof::serialize_command(&frame); - let _ = tx.try_send(AofMessage::Append(serialized)); + let lsn = aof::AofWriterPool::issue_append_lsn( + &ctx.repl_state, + ctx.shard_id, + serialized.len(), + ); + pool.try_send_append(ctx.shard_id, lsn, serialized); } } responses.push(response); @@ -1186,9 +1192,14 @@ pub(crate) async fn handle_connection_sharded_monoio< // AOF only on actual success (:1). Matches handler_single // — `:0` (key absent / dst exists w/o REPLACE) is a no-op. if matches!(response, Frame::Integer(1)) { - if let Some(ref tx) = ctx.aof_tx { + if let Some(ref pool) = ctx.aof_pool { let serialized = aof::serialize_command(&frame); - let _ = tx.try_send(AofMessage::Append(serialized)); + let lsn = aof::AofWriterPool::issue_append_lsn( + &ctx.repl_state, + ctx.shard_id, + serialized.len(), + ); + pool.try_send_append(ctx.shard_id, lsn, serialized); } } responses.push(response); @@ -1535,9 +1546,14 @@ pub(crate) async fn handle_connection_sharded_monoio< // AOF logging for successful local writes if !matches!(response, Frame::Error(_)) && is_write { - if let Some(ref tx) = ctx.aof_tx { + if let Some(ref pool) = ctx.aof_pool { let serialized = aof::serialize_command(&frame); - let _ = tx.try_send(AofMessage::Append(serialized)); + let lsn = aof::AofWriterPool::issue_append_lsn( + &ctx.repl_state, + ctx.shard_id, + serialized.len(), + ); + pool.try_send_append(ctx.shard_id, lsn, serialized); } } @@ -1768,7 +1784,7 @@ pub(crate) async fn handle_connection_sharded_monoio< let resp_idx = responses.len(); responses.push(Frame::Null); // placeholder, filled after batch dispatch // Pre-compute AOF bytes before moving frame into Arc - let aof_bytes = if ctx.aof_tx.is_some() && metadata::is_write(cmd) { + let aof_bytes = if ctx.aof_pool.is_some() && metadata::is_write(cmd) { Some(aof::serialize_command(&dispatch_frame)) } else { None @@ -1837,7 +1853,11 @@ pub(crate) async fn handle_connection_sharded_monoio< if !remote_groups.is_empty() { reply_futures.clear(); + // Capture `target` per batch so the cross-shard AOF write at the bottom + // of the loop can route to the owning shard's pool (not ctx.shard_id — + // mirrors the load-bearing fix at handler_sharded/mod.rs:1651). let mut oneshot_futures: Vec<( + usize, // target shard — owner for AOF append Vec<(usize, Option, Bytes)>, channel::OneshotReceiver>, )> = Vec::new(); @@ -1881,7 +1901,7 @@ pub(crate) async fn handle_connection_sharded_monoio< } } } - oneshot_futures.push((meta, reply_rx)); + oneshot_futures.push((target, meta, reply_rx)); } // Poll all shard responses via pending_wakers relay (monoio cross-thread waker fix). @@ -1889,7 +1909,7 @@ pub(crate) async fn handle_connection_sharded_monoio< // Instead, the connection task registers its waker in pending_wakers; the event // loop drains and wakes them after every SPSC cycle (~1ms). On wake, try_recv() // checks if the response arrived; if not, re-register and yield again. - for (meta, reply_rx) in oneshot_futures.drain(..) { + for (target, meta, reply_rx) in oneshot_futures.drain(..) { tracing::trace!( "Shard {}: awaiting cross-shard response via pending_wakers", ctx.shard_id @@ -1931,11 +1951,24 @@ pub(crate) async fn handle_connection_sharded_monoio< }; for ((resp_idx, aof_bytes, cmd_name), resp) in meta.into_iter().zip(shard_responses) { - // AOF logging for successful remote writes + // AOF logging for successful remote writes. + // Owner shard is `target` (NOT ctx.shard_id) — under PerShard + // layout the write must land in the target shard's AOF file + // since that shard owns the mutated data. Mirrors the + // load-bearing fix at handler_sharded/mod.rs:1651. if let Some(bytes) = aof_bytes { if !matches!(resp, Frame::Error(_)) { - if let Some(ref tx) = ctx.aof_tx { - let _ = tx.try_send(AofMessage::Append(bytes)); + if let Some(ref pool) = ctx.aof_pool { + // Cross-shard write: LSN must be sourced + // using `target`'s shard_id so the + // per-shard offset increment lands on the + // shard that owns the mutated data. + let lsn = aof::AofWriterPool::issue_append_lsn( + &ctx.repl_state, + target, + bytes.len(), + ); + pool.try_send_append(target, lsn, bytes); } } } diff --git a/src/server/conn/handler_sharded/dispatch.rs b/src/server/conn/handler_sharded/dispatch.rs index e73b77f2..bcd4e98a 100644 --- a/src/server/conn/handler_sharded/dispatch.rs +++ b/src/server/conn/handler_sharded/dispatch.rs @@ -353,9 +353,9 @@ pub(super) fn try_handle_persistence( return true; } if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") { - if let Some(ref tx) = ctx.aof_tx { + if let Some(ref pool) = ctx.aof_pool { responses.push(crate::command::persistence::bgrewriteaof_start_sharded( - tx, + pool, ctx.shard_databases.clone(), )); } else { diff --git a/src/server/conn/handler_sharded/mod.rs b/src/server/conn/handler_sharded/mod.rs index 83256ca8..7f41d655 100644 --- a/src/server/conn/handler_sharded/mod.rs +++ b/src/server/conn/handler_sharded/mod.rs @@ -15,7 +15,7 @@ use std::collections::HashMap; use crate::command::connection as conn_cmd; use crate::command::metadata; use crate::command::{DispatchResult, dispatch, dispatch_read}; -use crate::persistence::aof::{self, AofMessage}; +use crate::persistence::aof; use crate::protocol::Frame; use crate::shard::dispatch::{ShardMessage, key_to_shard}; use crate::shard::mesh::ChannelMesh; @@ -1119,8 +1119,8 @@ pub(crate) async fn handle_connection_sharded_inner< } } - let is_write = if ctx.aof_tx.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false }; - let aof_bytes = if is_write && ctx.aof_tx.is_some() { Some(aof::serialize_command(&frame)) } else { None }; + let is_write = if ctx.aof_pool.is_some() || conn.tracking_state.enabled { metadata::is_write(cmd) } else { false }; + let aof_bytes = if is_write && ctx.aof_pool.is_some() { Some(aof::serialize_command(&frame)) } else { None }; if is_local { // LOCAL PATH: split into read/write to avoid exclusive lock on reads. @@ -1172,7 +1172,10 @@ pub(crate) async fn handle_connection_sharded_inner< // — `:0` (key absent) is a no-op and must not log. if matches!(response, Frame::Integer(1)) { if let Some(ref bytes) = aof_bytes { - if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes.clone())); } + if let Some(ref pool) = ctx.aof_pool { + let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len()); + pool.try_send_append(ctx.shard_id, lsn, bytes.clone()); + } } } responses.push(response); @@ -1216,7 +1219,10 @@ pub(crate) async fn handle_connection_sharded_inner< // — `:0` (key absent / dst exists w/o REPLACE) is a no-op. if matches!(response, Frame::Integer(1)) { if let Some(ref bytes) = aof_bytes { - if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes.clone())); } + if let Some(ref pool) = ctx.aof_pool { + let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len()); + pool.try_send_append(ctx.shard_id, lsn, bytes.clone()); + } } } responses.push(response); @@ -1427,7 +1433,10 @@ pub(crate) async fn handle_connection_sharded_inner< } if let Some(bytes) = aof_bytes { if !matches!(response, Frame::Error(_)) { - if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); } + if let Some(ref pool) = ctx.aof_pool { + let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, ctx.shard_id, bytes.len()); + pool.try_send_append(ctx.shard_id, lsn, bytes); + } } } if conn.tracking_state.enabled && !matches!(response, Frame::Error(_)) { @@ -1646,9 +1655,20 @@ pub(crate) async fn handle_connection_sharded_inner< for (meta, target) in reply_futures { let shard_responses = response_pool.future_for(target).await; for ((resp_idx, aof_bytes, cmd_name), resp) in meta.into_iter().zip(shard_responses) { + // AOF logging for successful remote writes. + // Owner shard is `target` (NOT ctx.shard_id) — under PerShard + // layout the write must land in the target shard's AOF file + // since that shard owns the mutated data. This was the + // pre-existing routing bug that motivated the per-shard AOF + // RFC (Option B): under TopLevel a single writer absorbed + // every cross-shard append, masking the wrong-owner write. if let Some(bytes) = aof_bytes { if !matches!(resp, Frame::Error(_)) { - if let Some(ref tx) = ctx.aof_tx { let _ = tx.try_send(AofMessage::Append(bytes)); } + if let Some(ref pool) = ctx.aof_pool { + // Cross-shard: LSN sourced for `target`. + let lsn = aof::AofWriterPool::issue_append_lsn(&ctx.repl_state, target, bytes.len()); + pool.try_send_append(target, lsn, bytes); + } } } responses[resp_idx] = apply_resp3_conversion(&cmd_name, resp, proto_ver); diff --git a/src/server/conn/handler_single.rs b/src/server/conn/handler_single.rs index 4a626cc7..6aa4f3ad 100644 --- a/src/server/conn/handler_single.rs +++ b/src/server/conn/handler_single.rs @@ -42,7 +42,10 @@ use crate::server::codec::RespCodec; /// When `requirepass` is set, clients must authenticate via AUTH before any other /// commands are accepted (except QUIT). /// -/// When `aof_tx` is provided, write commands are logged to the AOF file. +/// When `aof_pool` is provided, write commands are logged via the per-shard +/// AOF writer pool. handler_single is single-shard mode by definition +/// (num_shards = 1, shard_id = 0), so the pool is always a TopLevel layout +/// wrapping a single writer sender — see `AofWriterPool::top_level`. /// When `change_counter` is provided, write commands increment the counter for auto-save. /// /// Supports Pub/Sub subscriber mode: when a client subscribes to channels/patterns, @@ -61,7 +64,7 @@ pub async fn handle_connection( shutdown: CancellationToken, requirepass: Option, config: Arc, - aof_tx: Option>, + aof_pool: Option>, change_counter: Option>, pubsub_registry: Arc>, runtime_config: Arc>, @@ -608,8 +611,8 @@ pub async fn handle_connection( } // BGREWRITEAOF if cmd.eq_ignore_ascii_case(b"BGREWRITEAOF") { - let response = if let Some(ref tx) = aof_tx { - crate::command::persistence::bgrewriteaof_start(tx, db.clone()) + let response = if let Some(ref pool) = aof_pool { + crate::command::persistence::bgrewriteaof_start(pool, db.clone()) } else { Frame::Error(Bytes::from_static(b"ERR AOF is not enabled")) }; @@ -655,7 +658,11 @@ pub async fn handle_connection( // WAL must be durable BEFORE the swap (no rollback // path for SWAPDB). Try-send first; on failure return // an error and leave both DBs untouched. - let wal_ok = if let Some(ref tx) = aof_tx { + // Drop down to the pool sender so we can still observe + // try_send's Result (the fire-and-forget + // pool.try_send_append loses the SendFailed signal we + // need to abort the swap cleanly). + let wal_ok = if let Some(ref pool) = aof_pool { let mut a_buf = itoa::Buffer::new(); let mut b_buf = itoa::Buffer::new(); let wal_frame = Frame::Array(crate::framevec![ @@ -671,12 +678,16 @@ pub async fn handle_connection( crate::persistence::aof::serialize_command( &wal_frame, ); - tx.try_send( - crate::persistence::aof::AofMessage::Append( - serialized, - ), - ) - .is_ok() + // Single-shard mode — shard_id = 0. + let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, serialized.len()); + pool.sender(0) + .try_send( + crate::persistence::aof::AofMessage::Append { + lsn, + bytes: serialized, + }, + ) + .is_ok() } else { true // persistence disabled — no durability requirement }; @@ -878,8 +889,16 @@ pub async fn handle_connection( } // Send AOF entries accumulated so far for bytes in aof_entries.drain(..) { - if let Some(ref tx) = aof_tx { - let _ = tx.send_async(AofMessage::Append(bytes)).await; + if let Some(ref pool) = aof_pool { + // Single-shard mode (shard_id = 0). send_async + // preserves back-pressure semantics from the + // pre-pool code; the pool's TopLevel layout + // routes to the same single writer. + let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len()); + let _ = pool + .sender(0) + .send_async(AofMessage::Append { lsn, bytes }) + .await; } if let Some(ref counter) = change_counter { counter.fetch_add(1, Ordering::Relaxed); @@ -1510,8 +1529,14 @@ pub async fn handle_connection( (resp, records) }; for record in wal_records { - if let Some(ref tx) = aof_tx { - let _ = tx.send_async(AofMessage::Append(bytes::Bytes::from(record))).await; + if let Some(ref pool) = aof_pool { + // Single-shard mode (shard_id = 0). + let bytes = bytes::Bytes::from(record); + let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len()); + let _ = pool + .sender(0) + .send_async(AofMessage::Append { lsn, bytes }) + .await; } if let Some(ref counter) = change_counter { counter.fetch_add(1, std::sync::atomic::Ordering::Relaxed); @@ -1528,7 +1553,7 @@ pub async fn handle_connection( let is_write = metadata::is_write(cmd); // Serialize for AOF before dispatch - let aof_bytes = if is_write && aof_tx.is_some() { + let aof_bytes = if is_write && aof_pool.is_some() { let mut buf = BytesMut::new(); crate::protocol::serialize::serialize(&frame, &mut buf); Some(buf.freeze()) @@ -2232,8 +2257,14 @@ pub async fn handle_connection( // --- Send AOF entries OUTSIDE the lock --- for bytes in aof_entries { - if let Some(ref tx) = aof_tx { - let _ = tx.send_async(AofMessage::Append(bytes)).await; + if let Some(ref pool) = aof_pool { + // Single-shard mode (shard_id = 0). send_async preserves + // back-pressure semantics from the pre-pool code. + let lsn = crate::persistence::aof::AofWriterPool::issue_append_lsn(&repl_state, 0, bytes.len()); + let _ = pool + .sender(0) + .send_async(AofMessage::Append { lsn, bytes }) + .await; } if let Some(ref counter) = change_counter { counter.fetch_add(1, Ordering::Relaxed); diff --git a/src/server/conn/tests.rs b/src/server/conn/tests.rs index 2f6c74e1..8a175eaf 100644 --- a/src/server/conn/tests.rs +++ b/src/server/conn/tests.rs @@ -31,7 +31,7 @@ fn test_inline_get_hit() { } let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"[..]); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -40,7 +40,8 @@ fn test_inline_get_hit() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -56,7 +57,7 @@ fn test_inline_get_miss() { let dbs = make_dbs(); let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"[..]); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -65,7 +66,8 @@ fn test_inline_get_miss() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -84,7 +86,7 @@ fn test_inline_set_falls_through_when_writes_disabled() { let mut read_buf = BytesMut::from(&cmd[..]); let original_len = read_buf.len(); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -93,7 +95,8 @@ fn test_inline_set_falls_through_when_writes_disabled() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -111,7 +114,7 @@ fn test_inline_set_executes_when_writes_enabled() { let cmd = b"*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n"; let mut read_buf = BytesMut::from(&cmd[..]); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -120,7 +123,8 @@ fn test_inline_set_executes_when_writes_enabled() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, true, @@ -150,7 +154,7 @@ fn test_inline_set_with_options_falls_through() { let mut read_buf = BytesMut::from(&cmd[..]); let original_len = read_buf.len(); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -159,7 +163,8 @@ fn test_inline_set_with_options_falls_through() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, true, @@ -176,7 +181,7 @@ fn test_inline_fallthrough() { let mut read_buf = BytesMut::from(&ping_cmd[..]); let original_len = read_buf.len(); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -185,7 +190,8 @@ fn test_inline_fallthrough() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -211,7 +217,7 @@ fn test_inline_mixed_batch() { read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$3\r\nfoo\r\n"); read_buf.extend_from_slice(b"*1\r\n$4\r\nPING\r\n"); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); // Inline loop should process GET but leave PING @@ -221,7 +227,8 @@ fn test_inline_mixed_batch() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -244,7 +251,7 @@ fn test_inline_case_insensitive() { } let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nget\r\n$3\r\nfoo\r\n"[..]); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -253,7 +260,8 @@ fn test_inline_case_insensitive() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -271,7 +279,7 @@ fn test_inline_partial() { let mut read_buf = BytesMut::from(&b"*2\r\n$3\r\nGET\r\n$3\r\n"[..]); let original_len = read_buf.len(); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let result = try_inline_dispatch( @@ -280,7 +288,8 @@ fn test_inline_partial() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -296,7 +305,8 @@ fn test_inline_set_with_aof_falls_through_when_writes_disabled() { // SET falls through when can_inline_writes=false even with AOF. let dbs = make_dbs(); let (aof_sender, _aof_receiver) = channel::mpsc_bounded::(16); - let aof_tx: Option> = Some(aof_sender); + let aof_pool: Option> = + Some(crate::persistence::aof::AofWriterPool::top_level(aof_sender)); let cmd = b"*3\r\n$3\r\nSET\r\n$3\r\nfoo\r\n$3\r\nbar\r\n"; let mut read_buf = BytesMut::from(&cmd[..]); let original_len = read_buf.len(); @@ -310,7 +320,8 @@ fn test_inline_set_with_aof_falls_through_when_writes_disabled() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, @@ -342,7 +353,7 @@ fn test_inline_multiple_gets() { read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$1\r\na\r\n"); read_buf.extend_from_slice(b"*2\r\n$3\r\nGET\r\n$1\r\nb\r\n"); let mut write_buf = BytesMut::new(); - let aof_tx: Option> = None; + let aof_pool: Option> = None; let rt_config = make_rt_config(); let total = try_inline_dispatch_loop( @@ -351,7 +362,8 @@ fn test_inline_multiple_gets() { &dbs, 0, 0, - &aof_tx, + &aof_pool, + &None, 0, 1, false, diff --git a/src/server/conn_state.rs b/src/server/conn_state.rs index 674857db..d2946f36 100644 --- a/src/server/conn_state.rs +++ b/src/server/conn_state.rs @@ -8,7 +8,7 @@ use ringbuf::HeapProd; use crate::blocking::BlockingRegistry; use crate::cluster::ClusterState; use crate::config::{RuntimeConfig, ServerConfig}; -use crate::persistence::aof::AofMessage; +use crate::persistence::aof::AofWriterPool; use crate::protocol::Frame; use crate::pubsub::PubSubRegistry; use crate::replication::state::ReplicationState; @@ -38,7 +38,12 @@ pub struct ConnectionContext { pub blocking_registry: Rc>, pub shutdown: CancellationToken, pub requirepass: Option, - pub aof_tx: Option>, + /// Per-shard AOF writer pool. **Sole AOF interface** after the + /// 2d/2e migration sequence. Built by spawn sites in `shard/conn_accept.rs` + /// from the manifest layout — TopLevel wraps a single sender, + /// PerShard owns one sender per shard. Step 2f flips spawn sites + /// to construct PerShard pools when the on-disk manifest demands it. + pub aof_pool: Option>, pub tracking_table: Rc>, pub repl_state: Option>>, pub cluster_state: Option>>, diff --git a/src/server/embedded.rs b/src/server/embedded.rs index 55a2b3fd..e1620fa3 100644 --- a/src/server/embedded.rs +++ b/src/server/embedded.rs @@ -48,7 +48,7 @@ use parking_lot::RwLock; use tracing::info; use crate::config::ServerConfig; -use crate::persistence::aof::{self, AofMessage, FsyncPolicy}; +use crate::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy}; use crate::runtime::cancel::CancellationToken; use crate::runtime::channel; use crate::runtime::{RuntimeFactoryImpl, traits::RuntimeFactory}; @@ -113,9 +113,12 @@ pub async fn run_embedded( // AOF writer: dedicated std::thread (matches main.rs lifetime model). // We retain the JoinHandle so shutdown can wait for the writer to finish // flushing — dropping it would race the process exit and risk losing the - // final fsync (CodeRabbit #1). - let (aof_tx, aof_join): ( - Option>, + // final fsync (CodeRabbit #1). Step 2f-α: wrap the sender in a TopLevel + // `AofWriterPool` so every shard receives an Arc clone with a uniform + // API. Channel close still drives writer termination — dropping the last + // Arc drops the pool, which drops the underlying senders. + let (aof_pool, aof_join): ( + Option>, Option>, ) = if config.appendonly == "yes" { let (tx, rx) = channel::mpsc_bounded::(10_000); @@ -132,7 +135,7 @@ pub async fn run_embedded( }) .context("embedded moon: failed to spawn AOF writer thread")?; info!("embedded moon: AOF enabled (fsync: {:?})", fsync); - (Some(tx), Some(handle)) + (Some(AofWriterPool::top_level(tx)), Some(handle)) } else { (None, None) }; @@ -263,7 +266,7 @@ pub async fn run_embedded( let consumers = mesh.take_consumers(id); let conn_rx = mesh.take_conn_rx(id); let shard_cancel = cancel.clone(); - let shard_aof_tx = aof_tx.clone(); + let shard_aof_pool = aof_pool.clone(); let shard_bind_addr = bind_addr.clone(); let shard_persistence_dir = persistence_dir.clone(); let shard_snap_rx = snap_rx.clone(); @@ -293,7 +296,7 @@ pub async fn run_embedded( consumers, producers, shard_cancel, - shard_aof_tx, + shard_aof_pool, Some(shard_bind_addr), shard_persistence_dir, shard_snap_rx, @@ -354,10 +357,12 @@ pub async fn run_embedded( // Listener exited (cancel fired or fatal error). Shutdown ordering: // 1. cancel.cancel() — stops shard accept loops + producers. - // 2. Join shard threads — drops every shard-held `aof_tx` clone. - // 3. Drop our outer `aof_tx` — last sender goes away, the AOF writer's - // `recv_async()` returns `Err(_)` and the task flushes + fsyncs - // before exiting (see `aof::aof_writer_task` Err arm). + // 2. Join shard threads — drops every shard-held `Arc` + // clone (each `ConnectionContext` and each `shard_aof_pool` capture). + // 3. Drop our outer `aof_pool` Arc — the last reference goes away, the + // pool's `Drop` runs, dropping the underlying `Vec`. The + // AOF writer's `recv_async()` returns `Err(_)` and the task flushes + // + fsyncs before exiting (see `aof::aof_writer_task` Err arm). // 4. Join the AOF thread. // // This sequencing fixes Qodo bug #5: sending `AofMessage::Shutdown` before @@ -381,8 +386,9 @@ pub async fn run_embedded( } } - // Drop the last AOF sender so the writer's recv loop sees channel close. - drop(aof_tx); + // Drop the last `Arc` so the underlying senders close, + // triggering the writer's recv loop to drain + fsync + exit. + drop(aof_pool); let aof_panic = if let Some(handle) = aof_join { match handle.join() { diff --git a/src/server/listener.rs b/src/server/listener.rs index 607aef54..37300cf2 100644 --- a/src/server/listener.rs +++ b/src/server/listener.rs @@ -13,7 +13,7 @@ use tracing::{debug, error, info}; use crate::command::connection as conn_cmd; use crate::config::ServerConfig; #[cfg(feature = "runtime-tokio")] -use crate::persistence::aof::{self, AofMessage, FsyncPolicy}; +use crate::persistence::aof::{self, AofMessage, AofWriterPool, FsyncPolicy}; #[cfg(feature = "runtime-tokio")] use crate::persistence::rdb; #[cfg(feature = "runtime-tokio")] @@ -115,15 +115,17 @@ pub async fn run_with_shutdown( let config = Arc::new(config); - // Set up AOF writer task if appendonly is enabled - let aof_tx: Option> = if config.appendonly == "yes" { + // Set up AOF writer task + wrap the sender as a TopLevel `AofWriterPool` + // (step 2f-α). Single-shard tokio path: `handler_single` receives the pool + // directly; per-shard layout is unreachable from this listener. + let aof_pool: Option> = if config.appendonly == "yes" { let (tx, rx) = channel::mpsc_bounded::(10_000); let aof_token = token.child_token(); let fsync = FsyncPolicy::from_str(&config.appendfsync); let aof_file_path = PathBuf::from(&config.dir).join(&config.appendfilename); tokio::spawn(aof::aof_writer_task(rx, aof_file_path, fsync, aof_token)); info!("AOF enabled with fsync policy: {:?}", fsync); - Some(tx) + Some(AofWriterPool::top_level(tx)) } else { None }; @@ -235,7 +237,7 @@ pub async fn run_with_shutdown( let conn_token = token.child_token(); let requirepass = config.requirepass.clone(); let config = config.clone(); - let aof_tx = aof_tx.clone(); + let aof_pool_conn = aof_pool.clone(); let change_counter = Some(change_counter.clone()); let pubsub = pubsub_registry.clone(); let rt_config = runtime_config.clone(); @@ -249,7 +251,7 @@ pub async fn run_with_shutdown( let gs = graph_store.clone(); tokio::spawn(connection::handle_connection( stream, db, conn_token, requirepass, config, - aof_tx, change_counter, pubsub, rt_config, + aof_pool_conn, change_counter, pubsub, rt_config, tracking, cid, Some(rs), acl, Some(vs), Some(ts), #[cfg(feature = "graph")] @@ -263,9 +265,12 @@ pub async fn run_with_shutdown( } _ = token.cancelled() => { info!("Server shutting down"); - // Send shutdown to AOF writer - if let Some(ref tx) = aof_tx { - let _ = tx.send_async(AofMessage::Shutdown).await; + // Fan out shutdown to every AOF writer (single writer under TopLevel, + // one-per-shard under PerShard — step 2f-β). `broadcast_shutdown` + // is `try_send`-based so the writer must be draining; under tokio + // listener it always is. + if let Some(ref pool) = aof_pool { + pool.broadcast_shutdown(); } break; } diff --git a/src/shard/conn_accept.rs b/src/shard/conn_accept.rs index 432d65aa..006fc885 100644 --- a/src/shard/conn_accept.rs +++ b/src/shard/conn_accept.rs @@ -104,7 +104,7 @@ pub(crate) fn spawn_tokio_connection( pubsub_arc: &Arc>, blocking_rc: &Rc>, shutdown: &CancellationToken, - aof_tx: &Option>, + aof_pool: &Option>, tracking_rc: &Rc>, lua_rc: &Rc>>>, script_cache_rc: &Rc>, @@ -134,7 +134,6 @@ pub(crate) fn spawn_tokio_connection( let psr = pubsub_arc.clone(); let blk = blocking_rc.clone(); let sd = shutdown.clone(); - let aof = aof_tx.clone(); let trk = tracking_rc.clone(); let cid = conn_cmd::next_client_id(); let rs = repl_state.clone(); @@ -170,7 +169,10 @@ pub(crate) fn spawn_tokio_connection( set_tcp_keepalive(tcp_stream.as_raw_fd(), tcp_keepalive_secs); } - // Construct ConnectionContext from cloned shared state + // Construct ConnectionContext from cloned shared state. The pool is + // already built by the spawn site (main.rs / listener.rs / embedded.rs) + // and threaded through `Shard::run` — we just clone the Arc once here. + let pool_for_ctx = aof_pool.as_ref().map(Arc::clone); let conn_ctx = crate::server::conn::ConnectionContext::new( sdbs, shard_id, @@ -178,7 +180,7 @@ pub(crate) fn spawn_tokio_connection( psr, blk, reqpass, - aof, + pool_for_ctx, trk, rs, cs, @@ -272,7 +274,7 @@ pub(crate) fn spawn_migrated_tokio_connection( pubsub_arc: &Arc>, blocking_rc: &Rc>, shutdown: &CancellationToken, - aof_tx: &Option>, + aof_pool: &Option>, tracking_rc: &Rc>, lua_rc: &Rc>>>, script_cache_rc: &Rc>, @@ -326,7 +328,6 @@ pub(crate) fn spawn_migrated_tokio_connection( let psr = pubsub_arc.clone(); let blk = blocking_rc.clone(); let sd = shutdown.clone(); - let aof = aof_tx.clone(); let trk = tracking_rc.clone(); let cid = state.client_id; let rs = repl_state.clone(); @@ -354,6 +355,8 @@ pub(crate) fn spawn_migrated_tokio_connection( let migration_buf = take_migration_read_buf(&mut state); + // Pool is built by the spawn site and threaded through here. + let pool_for_ctx = aof_pool.as_ref().map(Arc::clone); let conn_ctx = crate::server::conn::ConnectionContext::new( sdbs, shard_id, @@ -361,7 +364,7 @@ pub(crate) fn spawn_migrated_tokio_connection( psr, blk, None, // requirepass: None = pre-authenticated - aof, + pool_for_ctx, trk, rs, cs, @@ -421,7 +424,7 @@ pub(crate) fn spawn_monoio_connection( pubsub_arc: &Arc>, blocking_rc: &Rc>, shutdown: &CancellationToken, - aof_tx: &Option>, + aof_pool: &Option>, tracking_rc: &Rc>, lua_rc: &Rc>>>, script_cache_rc: &Rc>, @@ -464,7 +467,6 @@ pub(crate) fn spawn_monoio_connection( let psr = pubsub_arc.clone(); let blk = blocking_rc.clone(); let sd = shutdown.clone(); - let aof = aof_tx.clone(); let trk = tracking_rc.clone(); let cid = conn_cmd::next_client_id(); let rs = repl_state.clone(); @@ -500,12 +502,14 @@ pub(crate) fn spawn_monoio_connection( .map(|a| a.to_string()) .unwrap_or_else(|_| "unknown".to_string()); - // Construct ConnectionContext from cloned shared state + // Construct ConnectionContext from cloned shared state. Pool is + // built by the spawn site and threaded through here. let reqpass = rtcfg.read().requirepass.clone(); + let pool_for_ctx = aof_pool.as_ref().map(Arc::clone); let conn_ctx = crate::server::conn::ConnectionContext::new( - sdbs, shard_id, num_shards, psr, blk, reqpass, aof, trk, rs, cs, lua, sc, cp, acl, - rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, all_regs, all_rsm, aff, spill_tx, - spill_fid, do_dir, + sdbs, shard_id, num_shards, psr, blk, reqpass, pool_for_ctx, trk, rs, cs, lua, + sc, cp, acl, rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, all_regs, all_rsm, + aff, spill_tx, spill_fid, do_dir, ); let maxclients = conn_ctx.runtime_config.read().maxclients; @@ -713,7 +717,7 @@ pub(crate) fn spawn_migrated_monoio_connection( pubsub_arc: &Arc>, blocking_rc: &Rc>, shutdown: &CancellationToken, - aof_tx: &Option>, + aof_pool: &Option>, tracking_rc: &Rc>, lua_rc: &Rc>>>, script_cache_rc: &Rc>, @@ -766,7 +770,6 @@ pub(crate) fn spawn_migrated_monoio_connection( let psr = pubsub_arc.clone(); let blk = blocking_rc.clone(); let sd = shutdown.clone(); - let aof = aof_tx.clone(); let trk = tracking_rc.clone(); let cid = state.client_id; let rs = repl_state.clone(); @@ -802,11 +805,13 @@ pub(crate) fn spawn_migrated_monoio_connection( let migration_buf = take_migration_read_buf(&mut state); + // Pool is built by the spawn site and threaded through here. + let pool_for_ctx = aof_pool.as_ref().map(Arc::clone); let conn_ctx = crate::server::conn::ConnectionContext::new( sdbs, shard_id, num_shards, psr, blk, None, // requirepass: None = pre-authenticated - aof, trk, rs, cs, lua, sc, cp, acl, rtcfg, scfg, dtx, notifiers, snap_tx, clk, rsm, - all_regs, all_rsm, aff, spill_tx, spill_fid, do_dir, + pool_for_ctx, trk, rs, cs, lua, sc, cp, acl, rtcfg, scfg, dtx, notifiers, + snap_tx, clk, rsm, all_regs, all_rsm, aff, spill_tx, spill_fid, do_dir, ); monoio::spawn(async move { diff --git a/src/shard/event_loop.rs b/src/shard/event_loop.rs index 2b9de27d..5198d930 100644 --- a/src/shard/event_loop.rs +++ b/src/shard/event_loop.rs @@ -56,7 +56,7 @@ impl super::Shard { mut consumers: Vec>, producers: Vec>, shutdown: CancellationToken, - aof_tx: Option>, + aof_pool: Option>, bind_addr: Option, persistence_dir: Option, snapshot_trigger_rx: channel::WatchReceiver, @@ -1047,7 +1047,7 @@ impl super::Shard { conn_accept::spawn_tokio_connection( tcp_stream, false, &tls_config, &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc, - &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc, + &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, &acl_table, &runtime_config, &server_config, &all_notifiers, &snapshot_trigger_tx, &repl_state, &cluster_state, &cached_clock, &remote_sub_map_arc, &all_pubsub_registries, @@ -1097,7 +1097,7 @@ impl super::Shard { conn_accept::spawn_tokio_connection( tcp_stream, is_tls, &tls_config, &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc, - &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc, + &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, &acl_table, &runtime_config, &server_config, &all_notifiers, &snapshot_trigger_tx, &repl_state, &cluster_state, &cached_clock, &remote_sub_map_arc, &all_pubsub_registries, @@ -1180,7 +1180,7 @@ impl super::Shard { conn_accept::spawn_migrated_tokio_connection( fd, state, &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc, - &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc, + &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, &acl_table, &runtime_config, &server_config, &all_notifiers, &snapshot_trigger_tx, &repl_state, &cluster_state, &cached_clock, &remote_sub_map_arc, &all_pubsub_registries, @@ -1193,7 +1193,7 @@ impl super::Shard { conn_accept::spawn_migrated_monoio_connection( fd, state, &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc, - &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc, + &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, &acl_table, &runtime_config, &server_config, &all_notifiers, &snapshot_trigger_tx, &repl_state, &cluster_state, &cached_clock, &remote_sub_map_arc, &all_pubsub_registries, @@ -1277,7 +1277,7 @@ impl super::Shard { conn_accept::spawn_migrated_tokio_connection( fd, state, &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc, - &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc, + &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, &acl_table, &runtime_config, &server_config, &all_notifiers, &snapshot_trigger_tx, &repl_state, &cluster_state, &cached_clock, &remote_sub_map_arc, &all_pubsub_registries, @@ -1290,7 +1290,7 @@ impl super::Shard { conn_accept::spawn_migrated_monoio_connection( fd, state, &shard_databases, &dispatch_tx, &pubsub_arc, &blocking_rc, - &shutdown, &aof_tx, &tracking_rc, &lua_rc, &script_cache_rc, + &shutdown, &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, &acl_table, &runtime_config, &server_config, &all_notifiers, &snapshot_trigger_tx, &repl_state, &cluster_state, &cached_clock, &remote_sub_map_arc, &all_pubsub_registries, @@ -1697,7 +1697,7 @@ impl super::Shard { &pubsub_arc, &blocking_rc, &shutdown, - &aof_tx, + &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, @@ -1739,7 +1739,7 @@ impl super::Shard { &pubsub_arc, &blocking_rc, &shutdown, - &aof_tx, + &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, @@ -1921,7 +1921,7 @@ impl super::Shard { &pubsub_arc, &blocking_rc, &shutdown, - &aof_tx, + &aof_pool, &tracking_rc, &lua_rc, &script_cache_rc, diff --git a/tests/ft_search_multi_shard_as_of.rs b/tests/ft_search_multi_shard_as_of.rs index c8f8fef1..c2f93f11 100644 --- a/tests/ft_search_multi_shard_as_of.rs +++ b/tests/ft_search_multi_shard_as_of.rs @@ -46,6 +46,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/ft_search_temporal_parity.rs b/tests/ft_search_temporal_parity.rs index 065de5f9..1165f81a 100644 --- a/tests/ft_search_temporal_parity.rs +++ b/tests/ft_search_temporal_parity.rs @@ -58,6 +58,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/integration.rs b/tests/integration.rs index 1e6b51c8..d3c920ee 100644 --- a/tests/integration.rs +++ b/tests/integration.rs @@ -32,6 +32,7 @@ async fn start_server() -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), @@ -131,6 +132,7 @@ async fn start_server_with_pass(password: &str) -> (u16, CancellationToken) { databases: 16, requirepass: Some(password.to_string()), appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), @@ -1302,6 +1304,7 @@ async fn start_server_with_persistence( databases: 16, requirepass: None, appendonly: appendonly.to_string(), + unsafe_multishard_aof: false, appendfsync: appendfsync.to_string(), save: None, dir: dir.to_string_lossy().to_string(), @@ -2185,6 +2188,7 @@ async fn start_server_with_maxmemory(maxmemory: usize, policy: &str) -> (u16, Ca databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), @@ -2595,6 +2599,7 @@ async fn start_sharded_server(num_shards: usize) -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), @@ -3785,6 +3790,7 @@ async fn start_cluster_server() -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), @@ -4446,6 +4452,7 @@ async fn start_server_with_aclfile(acl_path: &str) -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/kill_snapshot.rs b/tests/kill_snapshot.rs index 0a96c81f..4141b13c 100644 --- a/tests/kill_snapshot.rs +++ b/tests/kill_snapshot.rs @@ -28,6 +28,7 @@ fn base_config(port: u16, num_shards: usize) -> ServerConfig { databases: 1, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "no".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/mq_integration.rs b/tests/mq_integration.rs index 6423a8e8..0ed42b5d 100644 --- a/tests/mq_integration.rs +++ b/tests/mq_integration.rs @@ -44,6 +44,7 @@ async fn start_mq_server(num_shards: usize) -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/replication_test.rs b/tests/replication_test.rs index 3e36686c..634ee733 100644 --- a/tests/replication_test.rs +++ b/tests/replication_test.rs @@ -30,6 +30,7 @@ async fn start_server() -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: dir_path, diff --git a/tests/txn_ft_search_snapshot.rs b/tests/txn_ft_search_snapshot.rs index f442afe1..edbdc710 100644 --- a/tests/txn_ft_search_snapshot.rs +++ b/tests/txn_ft_search_snapshot.rs @@ -46,6 +46,7 @@ fn build_config(port: u16, num_shards: usize) -> ServerConfig { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/txn_kv_wiring.rs b/tests/txn_kv_wiring.rs index 4a7ec97c..00921f8e 100644 --- a/tests/txn_kv_wiring.rs +++ b/tests/txn_kv_wiring.rs @@ -49,6 +49,7 @@ async fn start_txn_server(num_shards: usize, persistence_dir: &str) -> (u16, Can databases: 16, requirepass: None, appendonly, + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir, diff --git a/tests/vacuum_commands.rs b/tests/vacuum_commands.rs index 4b37fc59..2d861220 100644 --- a/tests/vacuum_commands.rs +++ b/tests/vacuum_commands.rs @@ -35,6 +35,7 @@ fn base_config(port: u16) -> ServerConfig { databases: 1, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "no".to_string(), save: None, dir: ".".to_string(), diff --git a/tests/workspace_integration.rs b/tests/workspace_integration.rs index fb833edd..08e1e0c2 100644 --- a/tests/workspace_integration.rs +++ b/tests/workspace_integration.rs @@ -37,6 +37,7 @@ async fn start_workspace_server(num_shards: usize) -> (u16, CancellationToken) { databases: 16, requirepass: None, appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(), @@ -254,6 +255,7 @@ async fn start_workspace_server_with_auth( databases: 16, requirepass: Some(password), appendonly: "no".to_string(), + unsafe_multishard_aof: false, appendfsync: "everysec".to_string(), save: None, dir: ".".to_string(),