Skip to content

fix: implement gateway fallback for empty ring in put, update and subscribe operations#4635

Open
netsirius wants to merge 1 commit into
mainfrom
fix/4365-put-update-bootstrap
Open

fix: implement gateway fallback for empty ring in put, update and subscribe operations#4635
netsirius wants to merge 1 commit into
mainfrom
fix/4365-put-update-bootstrap

Conversation

@netsirius

Copy link
Copy Markdown
Collaborator

Problem

A freshly started node has an empty ring (connections_by_location is empty) for its first minutes but already holds a live transient connection to its gateway. PUT/UPDATE/SUBSCRIBE peer-selection consulted only the ring, so during that window they ignored the gateway. #4364 fixed this for GET; the #4365 audit asks the other three to meet the same bar — in the bootstrap window an op must either work or fail with an explicit, retryable error, never silently misbehave.

Behavior on an empty ring before this change:

  • PUT — the relay stored the contract locally and finalized locally, so the client got a successful LocalCompletion while the contract never reached the network (no re-replication on this path; see fix: PUT finalizes far from contract location on sparse bootstrap topology, with no re-replication toward the contract neighborhood #4363). A false success.
  • UPDATE — if hosting, applied locally and broadcast to an empty ring (reached no one); if not hosting, returned NoHostingPeers. The hosting case has PUT's no-propagation gap.
  • SUBSCRIBE — already fail-fasts with NoHostingPeers, but its fallback scans only ring connections, so a node with a live transient gateway connection still gave up.

Note: the fix is not in drive_client_put_inner (its own_location() is only retry bookkeeping); the real empty-ring decision point is the relay's next-hop selection.

Solution

When connection_count() == 0, each op's real decision point falls back to a configured gateway instead of finalizing locally / failing — the same bounded fallback #4364 added to GET.

  • Shared helperselect_bootstrap_gateway (pure) and bootstrap_gateway_target (OpManager wrapper) lifted out of GET into operations/bootstrap.rs; all four ops now share one source of truth. GET behavior unchanged.
  • PUTdrive_relay_put / drive_relay_put_streaming forward via the gateway on an empty ring. The local store already ran, so the originator keeps its replica and the gateway relays the PUT toward the contract location. Bypasses the fix: PUT finalizes far from contract location on sparse bootstrap topology, with no re-replication toward the contract neighborhood #4363 terminus guard (no overshoot to guard against on an empty ring).
  • UPDATEdrive_client_update uses the gateway as target. Hosting → applies locally and forwards (propagates); not-hosting → the existing missing-params auto-fetch + retry primes the store from the gateway.
  • SUBSCRIBEprepare_initial_request tries the gateway before NoHostingPeers.

Wire delivery needs no new plumbing: OutboundMessageWithTarget resolves targets by socket addr against the transport map and re-dials configured gateways, independent of the message variant.

Boundary: a node that hasn't yet learned its own address (phase 1) keeps its current behavior; the fallback targets the dominant phase-2 window (address known, ring still empty). PUT's current_target stays own_location() there — unlike GET it is only driver-side telemetry, so the divergence is cosmetic (documented in code).

Testing

  • Selection logic (executed): unit tests in operations/bootstrap.rs assert empty-ring activation, self-skip, tried/visited exclusion, inactive-when-ring-non-empty, and no-configured-gateways → None.
  • Call sites (source pins): one per op, asserting both the call and the correct exclusion predicate (PUT new_skip_list.contains, UPDATE addr == sender_addr, SUBSCRIBE visited.probably_visited), so a no-op predicate trips the build.
  • No regression: for a lone gateway the fallback is a strict no-op (no non-self gateway → select_bootstrap_gateway returns None); isolated_node_regression.rs covers this on CI.
  • Delivery premise verified by code inspection (op-agnostic OutboundMessageWithTarget). cargo fmt, clippy -D warnings, and touched-module suites clean.
  • Adversarial audit: every empty-ring decision point across the four ops was mapped and skeptically reviewed; it confirmed the routing is correct and complete for the originator scope, that the connection_count() == 0 gate keys only on promoted ring connections (transient gateway connections are a disjoint map), and surfaced the residuals documented below.

Deferred (follow-up): an end-to-end behavioral test where a non-gateway node in its empty-ring window PUTs and a second node GETs the contract. A probe confirmed a small static sim can't reproduce the state — a lone node promotes its gateway into the ring (connection_count() settles at 1, not 0). The realistic shape is a multi-node simulation diagnostic with empty-ring windows occurring naturally, with a revert-to-verify gate — the way #4364's GET diagnostic worked.

Out of scope / follow-ups

The fix targets the originator bootstrap window. An adversarial audit of every empty-ring decision point across the four ops confirmed the routing is correct and complete in scope — and that adding the fallback to any other site would be either harmful (cross-gateway bounce / duplicate store / placement regression) or a separate concern. The residuals (all already fail explicitly in scope — none silently misbehave):

Fixes

Closes #4365

@netsirius netsirius requested a review from iduartgomez June 29, 2026 14:54
@github-actions

Copy link
Copy Markdown
Contributor

Rule Review: No blocking issues — one commit-message style note

Rules checked: git-workflow.md, code-style.md, testing.md, operations.md
Files reviewed: 8

Warnings

None.

Info

  • commits.txt:1 — Commit subject is 86 characters (fix: implement gateway fallback for empty ring in put, update and subscribe operations), exceeding the 72-character limit in git-workflow.md. (rule: git-workflow.md → "Is subject under 72 characters?")

All other rule-applicable items check out:

  • Module layout: operations/bootstrap.rs uses the new-style foo.rs layout, not mod.rs. pub(crate) mod bootstrap; declared in operations.rs in correct position. ✓
  • Error handling: No .unwrap() in production code. All unwrap/expect calls are confined to #[cfg(test)] blocks. ✓
  • fix: regression tests: The PR includes 5 behavioral unit tests in bootstrap.rs (covering empty ring, non-empty ring, self-exclusion, tried-gateway exclusion, and no-gateways case) plus source-grep pin tests for each newly wired call site in PUT, UPDATE, and SUBSCRIBE. The bootstrap_fallback_selects_gateway_when_ring_empty test is explicitly marked "Regression for fix: nightly test_get_reliability_diagnostic at 18.3% — ring is empty for most nodes during the op window, so no GET ever traverses the network (unmasked by #4352) #4361" and would catch the bug if select_bootstrap_gateway were broken. ✓
  • Edge-case coverage: Boundary values (ring_count=0 vs 1), error path (no configured gateways), and exclusion combinatorics (self, tried, all-tried) are all tested. ✓
  • No fire-and-forget spawns, no retry loops, no biased; selects, no push-before-send violations introduced. ✓
  • Import ordering in bootstrap.rs: std::crate:: — correct. ✓
  • Operations invariants: The fallback is applied at next-hop selection time (before any send), and the forward-before-caching ordering is not disturbed by the new None arm handling. ✓

Rule review against .claude/rules/. WARNING findings block merge.

@iduartgomez iduartgomez left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comprehensive PR Review: #4635

Summary

  • PR Title: fix: implement gateway fallback for empty ring in put, update and subscribe operations
  • Type: fix
  • CI Status: passing (all checks green on HEAD 9e0acf013)
  • Linked Issues: Closes #4365 (maintainer-authored audit naming PUT/UPDATE/SUBSCRIBE)
  • Review tier: Full (touches operations state machines + routing — high-risk surfaces)
  • Reviewers run: code-first, skeptical, testing, big-picture, plus a routing-loop/wire-compat lens. External model (codex) was unavailable (refresh_token_reused — needs re-login; gemini not installed), so per policy a diverse-Claude-lens fallback was substituted: 5 independent lenses, each blind to the others. Findings below are reconciled and each cited line was opened and verified.

Code-First Analysis

Independent Understanding: Generalizes GET's empty-ring gateway fallback (#4364) to PUT/UPDATE/SUBSCRIBE. New operations/bootstrap.rs holds select_bootstrap_gateway (pure) + bootstrap_gateway_target (OpManager wrapper), lifted verbatim from GET. When connection_count() == 0 each op forwards to a configured gateway instead of finalizing locally / failing.
Stated Intent: Matches the code faithfully — the description even pre-discloses the htl-gating asymmetry, the SUBSCRIBE ordering, the cosmetic current_target divergence, and the out-of-scope residuals.
Alignment: Matches, in scope. Closes #4365 within its stated originator scope.
Gaps: Two under-statements in the description (non-blocking): (1) SUBSCRIBE now routes to the gateway before the has_contract → LocallyComplete branch too, not just before NoHostingPeers (subscribe.rs:288 precedes :299/:308); (2) the apps/freenet-ping/Cargo.lock churn (+76/-2, ed25519/der/const-oid) is incidental fallout from the merged baked-key commits (b0a1b0683), unrelated to this fix.


Testing Assessment

Coverage Level: Selection core well covered; one call-site pin is defective.

Test Type Status Notes
Unit 5 pure tests in bootstrap.rs:120-191 — real fail-without-fix coverage of the selector (empty-ring activation, self-skip, tried-exclusion, inactive-when-non-empty, no-gateways→None)
Source-pin (per op) ⚠️ SUBSCRIBE + UPDATE + GET pins sound; PUT pin broken (see Must Fix #1)
Integration N/A
Simulation/E2E ❌ deferred end-to-end "empty-ring node PUTs, second node GETs" deferred — static sim can't reproduce (lone node self-promotes its gateway → connection_count() settles at 1, not 0)

Regression Test: Present for the selector; the per-op call-site guard is defective for the streaming PUT path (the single highest-value regression — false PUT success on empty ring — has no working guard at that site).
Missing Tests: A behavioral test driving drive_relay_put / drive_relay_put_streaming next-hop selection below the sim with connection_count()==0 + a configured gateway, asserting the gateway is chosen (covering both htl arms).


Skeptical Findings

Risk Level: Low (no correctness bug introduced); one test-integrity defect.

Concern Severity Location Details
Streaming-PUT source-pin guards nothing Must Fix put/op_ctx_task.rs:3292 \nasync fn body-end scan finds no unindented async fn after drive_relay_put_streaming<CB> (:2388), so "body" runs to EOF and swallows mod tests incl. the pin's own assertion literal (:3321). Verified empirically: neutering the streaming call site (:2469) to None leaves the test passing. Non-streaming arm (:1172→1799) IS correctly bounded.
htl==0 PUT comment justification imprecise Consider put/op_ctx_task.rs:1336-1345 Comment says the htl==0 forward is safe because the gateway "finalizes on its populated ring." Real termination guarantee in an all-empty-ring chain is skip_list exhaustion of configured gateways, not htl and not populated ring. Behavior is bounded & benign; the stated reason is incomplete.
UPDATE pin sound only by function ordering Consider update/op_ctx_task.rs:2123 Same fragile include_str!+\nasync fn pattern; passes today only because drive_client_update's body is bounded before the test module. Latent same flaw — harden with the GET-style production_source() truncation.

No-bug verifications (investigated, found safe):

  • connection_count() sums only connections_by_location (ring/connection_manager.rs:1363) — disjoint from the transient transport map, so the gate fires precisely in the bootstrap window as claimed.
  • Anti-bounce per protocol all confirmed: PUT carries monotonic new_skip_list (self+upstream), SUBSCRIBE marks gateway visited and propagates the bloom (subscribe.rs:322), UPDATE relay builds [self, sender] skip per hop. No path bounces back to originator. UPDATE bootstrap fallback is on the client driver only, not the HTL-less relay driver, so the admitted RequestUpdate-has-no-bloom risk is NOT reached by the empty-ring path.
  • select_bootstrap_gateway excludes self (own_addr) + caller's exclusion set; each call site passes a real predicate (bootstrap.rs:45-52).
  • OutboundMessageWithTarget resolves by socket addr and re-dials configured gateways before promotion, variant-agnostically (p2p_protoc.rs:1526-1642) — delivery to an unpromoted transient gateway works.
  • Step-1 local store precedes the forward decision in both relay PUT paths → gateway forward never produces a false success and the originator keeps its replica.

Big Picture Assessment

Goal Alignment: Yes — closes the maintainer-authored audit #4365 within originator scope; consolidates four ops onto one source of truth (reduces future drift).
Anti-Patterns Detected: None — no #[ignore], no weakened assertions, no swallowed errors. Adds explicit telemetry.
Removed Code Concerns: GET -150 is a verified clean extraction — helpers moved verbatim to bootstrap.rs, GET's 4 call sites byte-identical, GET source-pin survives. Net test coverage strictly increases (4 GET unit tests → 5 in bootstrap.rs + 3 new op pins). Nothing lost.
Scope Assessment: Focused, apart from the incidental apps/freenet-ping/Cargo.lock ed25519 sync (harmless, worth a one-line PR note).


Documentation

  • Code docs: complete (in-code comments thorough — one imprecise justification flagged above).
  • Architecture/rules: up-to-date; no .claude/rules/AGENTS.md reference rendered stale by the move. Optional: note operations/bootstrap.rs as the canonical empty-ring fallback in .claude/rules/operations.md.
  • User docs: n/a.

Recommendations

Must Fix (Blocking)

  1. Fix the broken relay_put_drivers_use_bootstrap_gateway_fallback pin (put/op_ctx_task.rs:3292). As written it does not guard the streaming PUT call site — empirically, deleting that fallback leaves the test green, silently reintroducing the exact false-success bug this PR fixes. Replace the "\nasync fn " EOF-scan with the GET-style approach: a production_source() helper that truncates at #[cfg(test)] before scanning (and/or brace-matched extract_fn_body). Reference pattern already in-repo: get/op_ctx_task.rs:3820/:3830.

Should Fix (Important)

  1. Add one behavioral test (not a source-pin) for the false-PUT-success regression: drive both drive_relay_put and drive_relay_put_streaming next-hop selection with connection_count()==0 + a configured gateway, assert the gateway is the chosen target — covering the htl==0 / htl>0 asymmetry the comments flag. This is the test that fails-without-fix; the pins are a proxy and one is defective.
  2. Harden the UPDATE pin (update/op_ctx_task.rs:2123) with the same production_source() truncation so it can't rot under a future function reorder.

Consider (Suggestions)

  1. Correct the htl==0 comment (put/op_ctx_task.rs:1336-1345) — termination is skip_list-bounded, not "populated ring"-bounded.
  2. PR-body note that apps/freenet-ping/Cargo.lock is an incidental ed25519 lockfile sync from b0a1b0683, so the crypto-dep churn in a bootstrap PR isn't surprising.
  3. PR-body note that SUBSCRIBE now also reorders ahead of the local-hosting LocallyComplete case (intended — a local holder must still join the subscription tree per subscribe.rs:207-214; the old local-complete was effectively a missed-updates bug). Adds gateway-timeout latency when the gateway is unreachable — confirm that tradeoff is intended.

Verdict

State: Needs Changes — Light Re-Check Sufficient
HEAD SHA reviewed: 9e0acf0

The production code is sound: no high/medium correctness bug, anti-bounce verified on every path, the GET extraction is clean, net coverage increases. The one blocking item is a test-integrity defect — the streaming-PUT source-pin guards nothing (verified empirically), so the most important regression can be reintroduced without CI catching it. Fixing the pin (Must Fix #1) is a small, isolated test-only change; a diff-of-the-diff re-check on that fix is sufficient rather than a full re-review. Should-Fix #2 (a real behavioral test) is the durable closure for the deferred coverage. Note: this PR is already maintainer-approved (@iduartgomez); the production behavior is fine — the asks are about test guards, not the fix itself.

[AI-assisted - Claude]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

audit: PUT and UPDATE bootstrap behavior on an empty ring (GET fixed in #4364)

2 participants