Add fail-fast timeout and diagnostics to cargo hakari disable by pcholakov · Pull Request #4822 · restatedev/restate

pcholakov · 2026-05-27T15:33:59Z

What this is (and isn't)

This PR is not a fix. cargo hakari disable in the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). After pulling the cancelled step's log, the picture got more interesting:

13:40:45.575  ##[endgroup]   (after "Run cargo hakari disable" group header)
                              ... 28 minutes 45 seconds of complete silence ...
14:09:30.388  ##[error]The operation was canceled.
14:09:30.562  Terminate orphan process: pid (1994) (cargo-hakari)
14:09:30.572  Terminate orphan process: pid (2003) (cargo)

cargo printed nothing for the entire 28 min. cargo's first default headers are "Updating crates.io index" and "Updating git repository <url>"; we see neither. So the hang is before the index download and before any git fetch starts. Possible silent-phase culprits: package-cache flock, libcurl/TLS init, DNS, TCP connection setup, or some pre-fetch resolver init. We do not yet have evidence to discriminate. Any "fix" right now would be guessing.

The original version of this PR added a bounded retry loop on the theory that the stall was in a git fetch. The evidence above doesn't support that theory, so the bound and retry have been dropped. The bound was speculation dressed up as a fix.

What this PR actually does

Two narrow changes, both about making the next hang useful rather than pretending to fix it:

timeout-minutes: 5 so the next stall fails the step in 5 min instead of taking down the 70-min job slot until a human notices. Healthy cold-cache runs finish in ~2 min and the worst non-stall observed in the sample was ~3 min, so 5 min keeps a comfortable margin.
Verbose cargo env + a pre-kill watchdog snapshot so the next stall actually leaves a paper trail:
- CARGO_TERM_VERBOSE, CARGO_HTTP_DEBUG, and a targeted CARGO_LOG cover Rust-level resolver/source ops and the underlying HTTP/TLS transport. Adds ~1k log lines on a healthy 2-min run; acceptable trade.
- At 4m30s a background watchdog dumps the live process tree (with wait-channel) and open TCP sockets, so even a stall that occurs before cargo logs anything still tells us which syscall it was wedged in and which connections it was holding.

Once a future hang reveals the actual stall point, the targeted fix follows.

Notes

The same unbounded cargo hakari disable exists in .github/workflows/steps/release-build-setup.yml (also cold-cache, release-critical). Left for a follow-up once we know the cause.
No release note: internal CI infrastructure only, no user-facing behaviour change.

AhmedSoliman

If the problem is fetching the crate.io index. Wouldn't that happen in any of the subsequent steps that will run cargo? I'm not sold on the solution path here.

pcholakov · 2026-06-01T09:08:46Z

If the problem is fetching the crate.io index. Wouldn't that happen in any of the subsequent steps that will run cargo? I'm not sold on the solution path here.

I suspected the difference could come from running in the bare GH action environment vs. inside Docker build, but it's speculative for sure.

I've updated the PR to just:

add a bit more diagnostic info
cap the hakari disable step with a native GH timeout so the workflow doesn't hang forever

LMK if you can think of some other improvements!

cargo hakari disable in the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). Crucially, cargo printed no output for the entire 28 min, so we do not yet know which phase actually stalled - it could be package-cache lock, libcurl/TLS init, DNS, TCP setup, sparse index download, or a git fetch. Without that, any "fix" is speculation. This change is intentionally limited to two things, both about making the next hang useful: * timeout-minutes: 5 - the step now fails fast at 5 min instead of occupying the 70-min job slot. Healthy cold-cache runs finish in ~2 min (worst non-stall observed ~3 min), so 5 min keeps a comfortable margin. * verbose cargo logging plus a pre-kill watchdog snapshot - CARGO_TERM_VERBOSE, CARGO_HTTP_DEBUG, and a targeted CARGO_LOG give Rust-level and libcurl-level visibility; at 4m30s a background snapshot dumps the live process tree (with wait-channel) and open TCP sockets so a silent hang still leaves evidence of what cargo was actually doing. Once a future hang reveals the stall point, the real fix can be targeted. No release note: internal CI infrastructure only.

github-actions · 2026-06-01T09:39:22Z

Test Results

8 files ±0 8 suites ±0 5m 0s ⏱️ +13s
60 tests ±0 60 ✅ ±0 0 💤 ±0 0 ❌ ±0
267 runs ±0 267 ✅ ±0 0 💤 ±0 0 ❌ ±0

Results for commit fb370f8. ± Comparison against base commit c23f244.

♻️ This comment has been updated with latest results.

AhmedSoliman · 2026-06-01T13:17:23Z

I think we run clippy and tests outside docker as well.

pcholakov requested a review from tillrohrmann May 27, 2026 15:34

AhmedSoliman requested changes May 29, 2026

View reviewed changes

pcholakov force-pushed the pavel/bound-hakari-disable branch from 4e5e456 to 771a982 Compare June 1, 2026 09:06

pcholakov changed the title ~~Bound and retry cargo hakari disable in the docker image build~~ Add fail-fast timeout and diagnostics to cargo hakari disable Jun 1, 2026

pcholakov removed the request for review from tillrohrmann June 1, 2026 09:08

pcholakov force-pushed the pavel/bound-hakari-disable branch from 771a982 to fb370f8 Compare June 1, 2026 09:10

pcholakov requested a review from AhmedSoliman June 1, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add fail-fast timeout and diagnostics to cargo hakari disable#4822

Add fail-fast timeout and diagnostics to cargo hakari disable#4822
pcholakov wants to merge 1 commit into
mainfrom
pavel/bound-hakari-disable

pcholakov commented May 27, 2026 •

edited

Loading

Uh oh!

AhmedSoliman left a comment

Uh oh!

pcholakov commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

AhmedSoliman commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pcholakov commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this is (and isn't)

What this PR actually does

Notes

Uh oh!

AhmedSoliman left a comment

Choose a reason for hiding this comment

Uh oh!

pcholakov commented Jun 1, 2026

Uh oh!

github-actions Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

AhmedSoliman commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pcholakov commented May 27, 2026 •

edited

Loading

github-actions Bot commented Jun 1, 2026 •

edited

Loading