Reduce flakiness of cluster_chaos_test (#4878) by tillrohrmann · Pull Request #4882 · restatedev/restate

tillrohrmann · 2026-06-03T16:30:42Z

The CI failure is most plausibly explained by recovery after a single node restart taking longer than the previous 10s
expected_recovery_interval — gossip's Suspect -> Alive transitions (5s default) plus cluster-controller leader thrashing can push it past that bound. This is a hypothesis based on the logs; we have not reproduced it locally. The changes here aim to make the test resilient to this and a couple of other timing footguns:

Override gossip_suspect_interval to 1s in the test config and raise the budgets to 60s / 30s so a normal full-stack recovery has comfortable headroom.
Pre-build one reqwest client per ingress with a 5s timeout. The client was previously rebuilt every iteration with no timeout, so a request hung against a node mid-shutdown could stall the loop.
Race the request against &mut chaos_handle in a biased tokio::select!. Previously, if the chaos task errored out, the main loop kept spinning for the rest of chaos_duration and panicked on successful writes: 0, hiding the actual error message. Now the chaos error surfaces immediately.

No production behavior changes.

This fixes #4878.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

The CI failure is most plausibly explained by recovery after a single node restart taking longer than the previous 10s `expected_recovery_interval` — gossip's `Suspect -> Alive` transitions (5s default) plus cluster-controller leader thrashing can push it past that bound. This is a hypothesis based on the logs; we have not reproduced it locally. The changes here aim to make the test resilient to this and a couple of other timing footguns: 1. Override `gossip_suspect_interval` to 1s in the test config and raise the budgets to 60s / 30s so a normal full-stack recovery has comfortable headroom. 2. Pre-build one reqwest client per ingress with a 5s timeout. The client was previously rebuilt every iteration with no timeout, so a request hung against a node mid-shutdown could stall the loop. 3. Race the request against `&mut chaos_handle` in a biased `tokio::select!`. Previously, if the chaos task errored out, the main loop kept spinning for the rest of `chaos_duration` and panicked on `successful writes: 0`, hiding the actual error message. Now the chaos error surfaces immediately. No production behavior changes. This fixes restatedev#4878. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

tillrohrmann requested a review from muhamadazmy June 3, 2026 16:30

claude Bot reviewed Jun 3, 2026

View reviewed changes

tillrohrmann force-pushed the issues/4878 branch from 370ae4f to d52a76a Compare June 3, 2026 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce flakiness of cluster_chaos_test (#4878)#4882

Reduce flakiness of cluster_chaos_test (#4878)#4882
tillrohrmann wants to merge 1 commit into
restatedev:mainfrom
tillrohrmann:issues/4878

tillrohrmann commented Jun 3, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tillrohrmann commented Jun 3, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant