Skip to content

fix(pubsub): bound consecutive unhealthy backend deaths#3863

Open
latifkasuli wants to merge 1 commit intoalloy-rs:mainfrom
latifkasuli:fix/ws-infinite-retry-on-unhealthy-backend
Open

fix(pubsub): bound consecutive unhealthy backend deaths#3863
latifkasuli wants to merge 1 commit intoalloy-rs:mainfrom
latifkasuli:fix/ws-infinite-retry-on-unhealthy-backend

Conversation

@latifkasuli
Copy link
Copy Markdown

Summary

Fixes #3821.

The retry budget (max_retries) currently only counts failed reconnect dial attempts inside reconnect_with_retries(). A backend that successfully reconnects at the WebSocket level but immediately dies -- close frame, invalid text, dropped channel -- resets the local retry_count on each cycle, causing the service to loop forever without producing any valid pubsub traffic. Callers like provider.get_block_number().await hang indefinitely.

This PR adds a service-level counter for consecutive backend deaths where no valid PubSubItem was ever received. A backend that delivers at least one item before dying resets the streak. Once the streak reaches max_retries, the service shuts down instead of reconnecting, which lets pending callers resolve with a transport error instead of hanging.

Changes

crates/pubsub/src/service.rs:

  • Add consecutive_unhealthy_backend_deaths and backend_had_progress fields to PubSubService.
  • Mark progress (backend_had_progress = true) on every valid PubSubItem received in handle_item.
  • Add record_backend_death_and_check_budget() helper: if the backend had progress, reset the streak to 0; otherwise increment it. If the streak reaches max_retries, return a terminal error.
  • Call the helper before every reconnect_with_retries() invocation in the spawn loop (channel close, error oneshot, and backend-gone dispatch paths).
  • Add 2 new unit tests:
    • consecutive_unhealthy_deaths_exhaust_budget: service terminates after N deaths without progress.
    • healthy_backend_resets_death_counter: a backend that produces traffic resets the counter.

crates/provider/tests/it/ws.rs:

  • ws_close_frame_exhausts_retry_budget: local WS server accepts upgrade, reads one request, sends a close frame. Provider call returns Err within the timeout instead of looping.
  • ws_invalid_text_exhausts_retry_budget: local WS server accepts upgrade, reads one request, echoes non-JSON-RPC text. Same bounded termination.

Non-goals

Per-request replay accounting (tracking how many times a specific in-flight request has been replayed across reconnects) is intentionally out of scope. In a mixed-traffic scenario, unrelated valid responses can still mask a starving individual request. That is a good follow-up but orthogonal to this fix.

Test plan

  • cargo test -p alloy-pubsub -- 3 tests pass (1 existing + 2 new)
  • cargo clippy -p alloy-pubsub -- -D warnings -- clean
  • cargo test -p alloy-provider --test it --features ws -- ws::ws_close_frame_exhausts_retry_budget ws::ws_invalid_text_exhausts_retry_budget -- 2 new integration tests pass
  • cargo test -p alloy-provider --test it --features ws -- ws::test_subscription_race_condition ws::ws_basic_auth_from_url -- existing mock-server tests still pass

The retry budget (`max_retries`) currently only counts failed reconnect
dial attempts. A backend that successfully reconnects at the WS level
but immediately dies (close frame, invalid text, dropped channel) resets
the local `retry_count` on each cycle, causing the service to loop
forever without producing any valid pubsub traffic.

Track consecutive backend deaths where no valid `PubSubItem` was ever
received. A backend that delivers at least one item before dying resets
the streak. Once the streak reaches `max_retries`, the service shuts
down instead of reconnecting, which lets pending callers resolve with a
transport error instead of hanging indefinitely.

Closes alloy-rs#3821
@latifkasuli latifkasuli force-pushed the fix/ws-infinite-retry-on-unhealthy-backend branch from a77c509 to 0d55b19 Compare April 5, 2026 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[Bug] WebSocket infinitely retries when server returns invalid protocol

1 participant