fix(pubsub): bound consecutive unhealthy backend deaths#3863
Open
latifkasuli wants to merge 1 commit intoalloy-rs:mainfrom
Open
fix(pubsub): bound consecutive unhealthy backend deaths#3863latifkasuli wants to merge 1 commit intoalloy-rs:mainfrom
latifkasuli wants to merge 1 commit intoalloy-rs:mainfrom
Conversation
The retry budget (`max_retries`) currently only counts failed reconnect dial attempts. A backend that successfully reconnects at the WS level but immediately dies (close frame, invalid text, dropped channel) resets the local `retry_count` on each cycle, causing the service to loop forever without producing any valid pubsub traffic. Track consecutive backend deaths where no valid `PubSubItem` was ever received. A backend that delivers at least one item before dying resets the streak. Once the streak reaches `max_retries`, the service shuts down instead of reconnecting, which lets pending callers resolve with a transport error instead of hanging indefinitely. Closes alloy-rs#3821
a77c509 to
0d55b19
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #3821.
The retry budget (
max_retries) currently only counts failed reconnect dial attempts insidereconnect_with_retries(). A backend that successfully reconnects at the WebSocket level but immediately dies -- close frame, invalid text, dropped channel -- resets the localretry_counton each cycle, causing the service to loop forever without producing any valid pubsub traffic. Callers likeprovider.get_block_number().awaithang indefinitely.This PR adds a service-level counter for consecutive backend deaths where no valid
PubSubItemwas ever received. A backend that delivers at least one item before dying resets the streak. Once the streak reachesmax_retries, the service shuts down instead of reconnecting, which lets pending callers resolve with a transport error instead of hanging.Changes
crates/pubsub/src/service.rs:consecutive_unhealthy_backend_deathsandbackend_had_progressfields toPubSubService.backend_had_progress = true) on every validPubSubItemreceived inhandle_item.record_backend_death_and_check_budget()helper: if the backend had progress, reset the streak to 0; otherwise increment it. If the streak reachesmax_retries, return a terminal error.reconnect_with_retries()invocation in thespawnloop (channel close, error oneshot, and backend-gone dispatch paths).consecutive_unhealthy_deaths_exhaust_budget: service terminates after N deaths without progress.healthy_backend_resets_death_counter: a backend that produces traffic resets the counter.crates/provider/tests/it/ws.rs:ws_close_frame_exhausts_retry_budget: local WS server accepts upgrade, reads one request, sends a close frame. Provider call returnsErrwithin the timeout instead of looping.ws_invalid_text_exhausts_retry_budget: local WS server accepts upgrade, reads one request, echoes non-JSON-RPC text. Same bounded termination.Non-goals
Per-request replay accounting (tracking how many times a specific in-flight request has been replayed across reconnects) is intentionally out of scope. In a mixed-traffic scenario, unrelated valid responses can still mask a starving individual request. That is a good follow-up but orthogonal to this fix.
Test plan
cargo test -p alloy-pubsub-- 3 tests pass (1 existing + 2 new)cargo clippy -p alloy-pubsub -- -D warnings-- cleancargo test -p alloy-provider --test it --features ws -- ws::ws_close_frame_exhausts_retry_budget ws::ws_invalid_text_exhausts_retry_budget-- 2 new integration tests passcargo test -p alloy-provider --test it --features ws -- ws::test_subscription_race_condition ws::ws_basic_auth_from_url-- existing mock-server tests still pass