Skip to content

Add fail-fast timeout and diagnostics to cargo hakari disable#4822

Open
pcholakov wants to merge 1 commit into
mainfrom
pavel/bound-hakari-disable
Open

Add fail-fast timeout and diagnostics to cargo hakari disable#4822
pcholakov wants to merge 1 commit into
mainfrom
pavel/bound-hakari-disable

Conversation

@pcholakov
Copy link
Copy Markdown
Contributor

@pcholakov pcholakov commented May 27, 2026

What this is (and isn't)

This PR is not a fix. cargo hakari disable in the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). After pulling the cancelled step's log, the picture got more interesting:

13:40:45.575  ##[endgroup]   (after "Run cargo hakari disable" group header)
                              ... 28 minutes 45 seconds of complete silence ...
14:09:30.388  ##[error]The operation was canceled.
14:09:30.562  Terminate orphan process: pid (1994) (cargo-hakari)
14:09:30.572  Terminate orphan process: pid (2003) (cargo)

cargo printed nothing for the entire 28 min. cargo's first default headers are "Updating crates.io index" and "Updating git repository <url>"; we see neither. So the hang is before the index download and before any git fetch starts. Possible silent-phase culprits: package-cache flock, libcurl/TLS init, DNS, TCP connection setup, or some pre-fetch resolver init. We do not yet have evidence to discriminate. Any "fix" right now would be guessing.

The original version of this PR added a bounded retry loop on the theory that the stall was in a git fetch. The evidence above doesn't support that theory, so the bound and retry have been dropped. The bound was speculation dressed up as a fix.

What this PR actually does

Two narrow changes, both about making the next hang useful rather than pretending to fix it:

  1. timeout-minutes: 5 so the next stall fails the step in 5 min instead of taking down the 70-min job slot until a human notices. Healthy cold-cache runs finish in ~2 min and the worst non-stall observed in the sample was ~3 min, so 5 min keeps a comfortable margin.

  2. Verbose cargo env + a pre-kill watchdog snapshot so the next stall actually leaves a paper trail:

    • CARGO_TERM_VERBOSE, CARGO_HTTP_DEBUG, and a targeted CARGO_LOG cover Rust-level resolver/source ops and the underlying HTTP/TLS transport. Adds ~1k log lines on a healthy 2-min run; acceptable trade.
    • At 4m30s a background watchdog dumps the live process tree (with wait-channel) and open TCP sockets, so even a stall that occurs before cargo logs anything still tells us which syscall it was wedged in and which connections it was holding.

Once a future hang reveals the actual stall point, the targeted fix follows.

Notes

  • The same unbounded cargo hakari disable exists in .github/workflows/steps/release-build-setup.yml (also cold-cache, release-critical). Left for a follow-up once we know the cause.
  • No release note: internal CI infrastructure only, no user-facing behaviour change.

@pcholakov pcholakov requested a review from tillrohrmann May 27, 2026 15:34
Copy link
Copy Markdown
Member

@AhmedSoliman AhmedSoliman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem is fetching the crate.io index. Wouldn't that happen in any of the subsequent steps that will run cargo? I'm not sold on the solution path here.

@pcholakov pcholakov force-pushed the pavel/bound-hakari-disable branch from 4e5e456 to 771a982 Compare June 1, 2026 09:06
@pcholakov pcholakov changed the title Bound and retry cargo hakari disable in the docker image build Add fail-fast timeout and diagnostics to cargo hakari disable Jun 1, 2026
@pcholakov
Copy link
Copy Markdown
Contributor Author

If the problem is fetching the crate.io index. Wouldn't that happen in any of the subsequent steps that will run cargo? I'm not sold on the solution path here.

I suspected the difference could come from running in the bare GH action environment vs. inside Docker build, but it's speculative for sure.

I've updated the PR to just:

  • add a bit more diagnostic info
  • cap the hakari disable step with a native GH timeout so the workflow doesn't hang forever

LMK if you can think of some other improvements!

@pcholakov pcholakov removed the request for review from tillrohrmann June 1, 2026 09:08
cargo hakari disable in the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). Crucially, cargo printed no output for the entire 28 min, so we do not yet know which phase actually stalled - it could be package-cache lock, libcurl/TLS init, DNS, TCP setup, sparse index download, or a git fetch. Without that, any "fix" is speculation.

This change is intentionally limited to two things, both about making the next hang useful:

* timeout-minutes: 5 - the step now fails fast at 5 min instead of occupying the 70-min job slot. Healthy cold-cache runs finish in ~2 min (worst non-stall observed ~3 min), so 5 min keeps a comfortable margin.

* verbose cargo logging plus a pre-kill watchdog snapshot - CARGO_TERM_VERBOSE, CARGO_HTTP_DEBUG, and a targeted CARGO_LOG give Rust-level and libcurl-level visibility; at 4m30s a background snapshot dumps the live process tree (with wait-channel) and open TCP sockets so a silent hang still leaves evidence of what cargo was actually doing.

Once a future hang reveals the stall point, the real fix can be targeted. No release note: internal CI infrastructure only.
@pcholakov pcholakov force-pushed the pavel/bound-hakari-disable branch from 771a982 to fb370f8 Compare June 1, 2026 09:10
@pcholakov pcholakov requested a review from AhmedSoliman June 1, 2026 09:13
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Test Results

  8 files  ±0    8 suites  ±0   5m 0s ⏱️ +13s
 60 tests ±0   60 ✅ ±0  0 💤 ±0  0 ❌ ±0 
267 runs  ±0  267 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit fb370f8. ± Comparison against base commit c23f244.

♻️ This comment has been updated with latest results.

@AhmedSoliman
Copy link
Copy Markdown
Member

I think we run clippy and tests outside docker as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants