fix: implement gateway fallback for empty ring in put, update and subscribe operations#4635
fix: implement gateway fallback for empty ring in put, update and subscribe operations#4635netsirius wants to merge 1 commit into
Conversation
…scribe operations
Rule Review: No blocking issues — one commit-message style noteRules checked: git-workflow.md, code-style.md, testing.md, operations.md WarningsNone. Info
All other rule-applicable items check out:
Rule review against |
iduartgomez
left a comment
There was a problem hiding this comment.
Comprehensive PR Review: #4635
Summary
- PR Title: fix: implement gateway fallback for empty ring in put, update and subscribe operations
- Type: fix
- CI Status: passing (all checks green on HEAD
9e0acf013) - Linked Issues: Closes #4365 (maintainer-authored audit naming PUT/UPDATE/SUBSCRIBE)
- Review tier: Full (touches operations state machines + routing — high-risk surfaces)
- Reviewers run: code-first, skeptical, testing, big-picture, plus a routing-loop/wire-compat lens. External model (codex) was unavailable (
refresh_token_reused— needs re-login; gemini not installed), so per policy a diverse-Claude-lens fallback was substituted: 5 independent lenses, each blind to the others. Findings below are reconciled and each cited line was opened and verified.
Code-First Analysis
Independent Understanding: Generalizes GET's empty-ring gateway fallback (#4364) to PUT/UPDATE/SUBSCRIBE. New operations/bootstrap.rs holds select_bootstrap_gateway (pure) + bootstrap_gateway_target (OpManager wrapper), lifted verbatim from GET. When connection_count() == 0 each op forwards to a configured gateway instead of finalizing locally / failing.
Stated Intent: Matches the code faithfully — the description even pre-discloses the htl-gating asymmetry, the SUBSCRIBE ordering, the cosmetic current_target divergence, and the out-of-scope residuals.
Alignment: Matches, in scope. Closes #4365 within its stated originator scope.
Gaps: Two under-statements in the description (non-blocking): (1) SUBSCRIBE now routes to the gateway before the has_contract → LocallyComplete branch too, not just before NoHostingPeers (subscribe.rs:288 precedes :299/:308); (2) the apps/freenet-ping/Cargo.lock churn (+76/-2, ed25519/der/const-oid) is incidental fallout from the merged baked-key commits (b0a1b0683), unrelated to this fix.
Testing Assessment
Coverage Level: Selection core well covered; one call-site pin is defective.
| Test Type | Status | Notes |
|---|---|---|
| Unit | ✅ | 5 pure tests in bootstrap.rs:120-191 — real fail-without-fix coverage of the selector (empty-ring activation, self-skip, tried-exclusion, inactive-when-non-empty, no-gateways→None) |
| Source-pin (per op) | SUBSCRIBE + UPDATE + GET pins sound; PUT pin broken (see Must Fix #1) | |
| Integration | N/A | — |
| Simulation/E2E | ❌ deferred | end-to-end "empty-ring node PUTs, second node GETs" deferred — static sim can't reproduce (lone node self-promotes its gateway → connection_count() settles at 1, not 0) |
Regression Test: Present for the selector; the per-op call-site guard is defective for the streaming PUT path (the single highest-value regression — false PUT success on empty ring — has no working guard at that site).
Missing Tests: A behavioral test driving drive_relay_put / drive_relay_put_streaming next-hop selection below the sim with connection_count()==0 + a configured gateway, asserting the gateway is chosen (covering both htl arms).
Skeptical Findings
Risk Level: Low (no correctness bug introduced); one test-integrity defect.
| Concern | Severity | Location | Details |
|---|---|---|---|
| Streaming-PUT source-pin guards nothing | Must Fix | put/op_ctx_task.rs:3292 |
\nasync fn body-end scan finds no unindented async fn after drive_relay_put_streaming<CB> (:2388), so "body" runs to EOF and swallows mod tests incl. the pin's own assertion literal (:3321). Verified empirically: neutering the streaming call site (:2469) to None leaves the test passing. Non-streaming arm (:1172→1799) IS correctly bounded. |
| htl==0 PUT comment justification imprecise | Consider | put/op_ctx_task.rs:1336-1345 |
Comment says the htl==0 forward is safe because the gateway "finalizes on its populated ring." Real termination guarantee in an all-empty-ring chain is skip_list exhaustion of configured gateways, not htl and not populated ring. Behavior is bounded & benign; the stated reason is incomplete. |
| UPDATE pin sound only by function ordering | Consider | update/op_ctx_task.rs:2123 |
Same fragile include_str!+\nasync fn pattern; passes today only because drive_client_update's body is bounded before the test module. Latent same flaw — harden with the GET-style production_source() truncation. |
No-bug verifications (investigated, found safe):
connection_count()sums onlyconnections_by_location(ring/connection_manager.rs:1363) — disjoint from the transient transport map, so the gate fires precisely in the bootstrap window as claimed.- Anti-bounce per protocol all confirmed: PUT carries monotonic
new_skip_list(self+upstream), SUBSCRIBE marks gatewayvisitedand propagates the bloom (subscribe.rs:322), UPDATE relay builds[self, sender]skip per hop. No path bounces back to originator. UPDATE bootstrap fallback is on the client driver only, not the HTL-less relay driver, so the admittedRequestUpdate-has-no-bloom risk is NOT reached by the empty-ring path. select_bootstrap_gatewayexcludes self (own_addr) + caller's exclusion set; each call site passes a real predicate (bootstrap.rs:45-52).OutboundMessageWithTargetresolves by socket addr and re-dials configured gateways before promotion, variant-agnostically (p2p_protoc.rs:1526-1642) — delivery to an unpromoted transient gateway works.- Step-1 local store precedes the forward decision in both relay PUT paths → gateway forward never produces a false success and the originator keeps its replica.
Big Picture Assessment
Goal Alignment: Yes — closes the maintainer-authored audit #4365 within originator scope; consolidates four ops onto one source of truth (reduces future drift).
Anti-Patterns Detected: None — no #[ignore], no weakened assertions, no swallowed errors. Adds explicit telemetry.
Removed Code Concerns: GET -150 is a verified clean extraction — helpers moved verbatim to bootstrap.rs, GET's 4 call sites byte-identical, GET source-pin survives. Net test coverage strictly increases (4 GET unit tests → 5 in bootstrap.rs + 3 new op pins). Nothing lost.
Scope Assessment: Focused, apart from the incidental apps/freenet-ping/Cargo.lock ed25519 sync (harmless, worth a one-line PR note).
Documentation
- Code docs: complete (in-code comments thorough — one imprecise justification flagged above).
- Architecture/rules: up-to-date; no
.claude/rules/AGENTS.mdreference rendered stale by the move. Optional: noteoperations/bootstrap.rsas the canonical empty-ring fallback in.claude/rules/operations.md. - User docs: n/a.
Recommendations
Must Fix (Blocking)
- Fix the broken
relay_put_drivers_use_bootstrap_gateway_fallbackpin (put/op_ctx_task.rs:3292). As written it does not guard the streaming PUT call site — empirically, deleting that fallback leaves the test green, silently reintroducing the exact false-success bug this PR fixes. Replace the"\nasync fn "EOF-scan with the GET-style approach: aproduction_source()helper that truncates at#[cfg(test)]before scanning (and/or brace-matchedextract_fn_body). Reference pattern already in-repo:get/op_ctx_task.rs:3820/:3830.
Should Fix (Important)
- Add one behavioral test (not a source-pin) for the false-PUT-success regression: drive both
drive_relay_putanddrive_relay_put_streamingnext-hop selection withconnection_count()==0+ a configured gateway, assert the gateway is the chosen target — covering the htl==0 / htl>0 asymmetry the comments flag. This is the test that fails-without-fix; the pins are a proxy and one is defective. - Harden the UPDATE pin (
update/op_ctx_task.rs:2123) with the sameproduction_source()truncation so it can't rot under a future function reorder.
Consider (Suggestions)
- Correct the htl==0 comment (
put/op_ctx_task.rs:1336-1345) — termination is skip_list-bounded, not "populated ring"-bounded. - PR-body note that
apps/freenet-ping/Cargo.lockis an incidental ed25519 lockfile sync fromb0a1b0683, so the crypto-dep churn in a bootstrap PR isn't surprising. - PR-body note that SUBSCRIBE now also reorders ahead of the local-hosting
LocallyCompletecase (intended — a local holder must still join the subscription tree persubscribe.rs:207-214; the old local-complete was effectively a missed-updates bug). Adds gateway-timeout latency when the gateway is unreachable — confirm that tradeoff is intended.
Verdict
State: Needs Changes — Light Re-Check Sufficient
HEAD SHA reviewed: 9e0acf0
The production code is sound: no high/medium correctness bug, anti-bounce verified on every path, the GET extraction is clean, net coverage increases. The one blocking item is a test-integrity defect — the streaming-PUT source-pin guards nothing (verified empirically), so the most important regression can be reintroduced without CI catching it. Fixing the pin (Must Fix #1) is a small, isolated test-only change; a diff-of-the-diff re-check on that fix is sufficient rather than a full re-review. Should-Fix #2 (a real behavioral test) is the durable closure for the deferred coverage. Note: this PR is already maintainer-approved (@iduartgomez); the production behavior is fine — the asks are about test guards, not the fix itself.
[AI-assisted - Claude]
Problem
A freshly started node has an empty ring (
connections_by_locationis empty) for its first minutes but already holds a live transient connection to its gateway. PUT/UPDATE/SUBSCRIBE peer-selection consulted only the ring, so during that window they ignored the gateway. #4364 fixed this for GET; the #4365 audit asks the other three to meet the same bar — in the bootstrap window an op must either work or fail with an explicit, retryable error, never silently misbehave.Behavior on an empty ring before this change:
LocalCompletionwhile the contract never reached the network (no re-replication on this path; see fix: PUT finalizes far from contract location on sparse bootstrap topology, with no re-replication toward the contract neighborhood #4363). A false success.NoHostingPeers. The hosting case has PUT's no-propagation gap.NoHostingPeers, but its fallback scans only ring connections, so a node with a live transient gateway connection still gave up.Note: the fix is not in
drive_client_put_inner(itsown_location()is only retry bookkeeping); the real empty-ring decision point is the relay's next-hop selection.Solution
When
connection_count() == 0, each op's real decision point falls back to a configured gateway instead of finalizing locally / failing — the same bounded fallback #4364 added to GET.select_bootstrap_gateway(pure) andbootstrap_gateway_target(OpManagerwrapper) lifted out of GET intooperations/bootstrap.rs; all four ops now share one source of truth. GET behavior unchanged.drive_relay_put/drive_relay_put_streamingforward via the gateway on an empty ring. The local store already ran, so the originator keeps its replica and the gateway relays the PUT toward the contract location. Bypasses the fix: PUT finalizes far from contract location on sparse bootstrap topology, with no re-replication toward the contract neighborhood #4363 terminus guard (no overshoot to guard against on an empty ring).drive_client_updateuses the gateway as target. Hosting → applies locally and forwards (propagates); not-hosting → the existing missing-params auto-fetch + retry primes the store from the gateway.prepare_initial_requesttries the gateway beforeNoHostingPeers.Wire delivery needs no new plumbing:
OutboundMessageWithTargetresolves targets by socket addr against the transport map and re-dials configured gateways, independent of the message variant.Boundary: a node that hasn't yet learned its own address (phase 1) keeps its current behavior; the fallback targets the dominant phase-2 window (address known, ring still empty). PUT's
current_targetstaysown_location()there — unlike GET it is only driver-side telemetry, so the divergence is cosmetic (documented in code).Testing
operations/bootstrap.rsassert empty-ring activation, self-skip, tried/visited exclusion, inactive-when-ring-non-empty, and no-configured-gateways →None.new_skip_list.contains, UPDATEaddr == sender_addr, SUBSCRIBEvisited.probably_visited), so a no-op predicate trips the build.select_bootstrap_gatewayreturnsNone);isolated_node_regression.rscovers this on CI.OutboundMessageWithTarget).cargo fmt,clippy -D warnings, and touched-module suites clean.connection_count() == 0gate keys only on promoted ring connections (transient gateway connections are a disjoint map), and surfaced the residuals documented below.Deferred (follow-up): an end-to-end behavioral test where a non-gateway node in its empty-ring window PUTs and a second node GETs the contract. A probe confirmed a small static sim can't reproduce the state — a lone node promotes its gateway into the ring (
connection_count()settles at 1, not 0). The realistic shape is a multi-node simulation diagnostic with empty-ring windows occurring naturally, with a revert-to-verify gate — the way #4364's GET diagnostic worked.Out of scope / follow-ups
The fix targets the originator bootstrap window. An adversarial audit of every empty-ring decision point across the four ops confirmed the routing is correct and complete in scope — and that adding the fallback to any other site would be either harmful (cross-gateway bounce / duplicate store / placement regression) or a separate concern. The residuals (all already fail explicitly in scope — none silently misbehave):
drive_relay_request_update,drive_relay_subscribe) and SUBSCRIBE's retry (advance_to_next_peer_impl) are not gateway-aware. They already fail explicitly (NoHostingPeers/NotFound) and are out of scope (a relay holds a connection from its sender). Extending them is not free parity:UpdateMsg::RequestUpdatecarries no HTL/visited bloom, so a gateway fallback there would risk a bounce until loop-prevention is added.drive_relay_put's empty-ring arm can forward to a gateway athtl == 0while the streaming path gates onhtl > 0. Benign (single bounded hop, no false success, better placement than this empty-ring node) and documented in-code; harmonizing has no correctness benefit.Fixes
Closes #4365