Skip to content

Reduce flakiness of cluster_chaos_test (#4878)#4882

Open
tillrohrmann wants to merge 1 commit into
restatedev:mainfrom
tillrohrmann:issues/4878
Open

Reduce flakiness of cluster_chaos_test (#4878)#4882
tillrohrmann wants to merge 1 commit into
restatedev:mainfrom
tillrohrmann:issues/4878

Conversation

@tillrohrmann
Copy link
Copy Markdown
Contributor

The CI failure is most plausibly explained by recovery after a single node restart taking longer than the previous 10s
expected_recovery_interval — gossip's Suspect -> Alive transitions (5s default) plus cluster-controller leader thrashing can push it past that bound. This is a hypothesis based on the logs; we have not reproduced it locally. The changes here aim to make the test resilient to this and a couple of other timing footguns:

  1. Override gossip_suspect_interval to 1s in the test config and raise the budgets to 60s / 30s so a normal full-stack recovery has comfortable headroom.
  2. Pre-build one reqwest client per ingress with a 5s timeout. The client was previously rebuilt every iteration with no timeout, so a request hung against a node mid-shutdown could stall the loop.
  3. Race the request against &mut chaos_handle in a biased tokio::select!. Previously, if the chaos task errored out, the main loop kept spinning for the rest of chaos_duration and panicked on successful writes: 0, hiding the actual error message. Now the chaos error surfaces immediately.

No production behavior changes.

This fixes #4878.

@tillrohrmann tillrohrmann requested a review from muhamadazmy June 3, 2026 16:30
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

The CI failure is most plausibly explained by recovery after a single
node restart taking longer than the previous 10s
`expected_recovery_interval` — gossip's `Suspect -> Alive` transitions
(5s default) plus cluster-controller leader thrashing can push it past
that bound. This is a hypothesis based on the logs; we have not
reproduced it locally. The changes here aim to make the test
resilient to this and a couple of other timing footguns:

1. Override `gossip_suspect_interval` to 1s in the test config and
   raise the budgets to 60s / 30s so a normal full-stack recovery has
   comfortable headroom.
2. Pre-build one reqwest client per ingress with a 5s timeout. The
   client was previously rebuilt every iteration with no timeout, so a
   request hung against a node mid-shutdown could stall the loop.
3. Race the request against `&mut chaos_handle` in a biased
   `tokio::select!`. Previously, if the chaos task errored out, the
   main loop kept spinning for the rest of `chaos_duration` and
   panicked on `successful writes: 0`, hiding the actual error
   message. Now the chaos error surfaces immediately.

No production behavior changes.

This fixes restatedev#4878.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

restate-server::cluster::cluster_chaos_test failed on CI

1 participant