Add fail-fast timeout and diagnostics to cargo hakari disable#4822
Open
pcholakov wants to merge 1 commit into
Open
Add fail-fast timeout and diagnostics to cargo hakari disable#4822pcholakov wants to merge 1 commit into
pcholakov wants to merge 1 commit into
Conversation
AhmedSoliman
requested changes
May 29, 2026
Member
AhmedSoliman
left a comment
There was a problem hiding this comment.
If the problem is fetching the crate.io index. Wouldn't that happen in any of the subsequent steps that will run cargo? I'm not sold on the solution path here.
4e5e456 to
771a982
Compare
Contributor
Author
I suspected the difference could come from running in the bare GH action environment vs. inside Docker build, but it's speculative for sure. I've updated the PR to just:
LMK if you can think of some other improvements! |
cargo hakari disable in the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). Crucially, cargo printed no output for the entire 28 min, so we do not yet know which phase actually stalled - it could be package-cache lock, libcurl/TLS init, DNS, TCP setup, sparse index download, or a git fetch. Without that, any "fix" is speculation. This change is intentionally limited to two things, both about making the next hang useful: * timeout-minutes: 5 - the step now fails fast at 5 min instead of occupying the 70-min job slot. Healthy cold-cache runs finish in ~2 min (worst non-stall observed ~3 min), so 5 min keeps a comfortable margin. * verbose cargo logging plus a pre-kill watchdog snapshot - CARGO_TERM_VERBOSE, CARGO_HTTP_DEBUG, and a targeted CARGO_LOG give Rust-level and libcurl-level visibility; at 4m30s a background snapshot dumps the live process tree (with wait-channel) and open TCP sockets so a silent hang still leaves evidence of what cargo was actually doing. Once a future hang reveals the stall point, the real fix can be targeted. No release note: internal CI infrastructure only.
771a982 to
fb370f8
Compare
Member
|
I think we run clippy and tests outside docker as well. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this is (and isn't)
This PR is not a fix.
cargo hakari disablein the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). After pulling the cancelled step's log, the picture got more interesting:cargo printed nothing for the entire 28 min. cargo's first default headers are "Updating crates.io index" and "Updating git repository
<url>"; we see neither. So the hang is before the index download and before any git fetch starts. Possible silent-phase culprits: package-cache flock, libcurl/TLS init, DNS, TCP connection setup, or some pre-fetch resolver init. We do not yet have evidence to discriminate. Any "fix" right now would be guessing.The original version of this PR added a bounded retry loop on the theory that the stall was in a git fetch. The evidence above doesn't support that theory, so the bound and retry have been dropped. The bound was speculation dressed up as a fix.
What this PR actually does
Two narrow changes, both about making the next hang useful rather than pretending to fix it:
timeout-minutes: 5so the next stall fails the step in 5 min instead of taking down the 70-min job slot until a human notices. Healthy cold-cache runs finish in ~2 min and the worst non-stall observed in the sample was ~3 min, so 5 min keeps a comfortable margin.Verbose cargo env + a pre-kill watchdog snapshot so the next stall actually leaves a paper trail:
CARGO_TERM_VERBOSE,CARGO_HTTP_DEBUG, and a targetedCARGO_LOGcover Rust-level resolver/source ops and the underlying HTTP/TLS transport. Adds ~1k log lines on a healthy 2-min run; acceptable trade.Once a future hang reveals the actual stall point, the targeted fix follows.
Notes
cargo hakari disableexists in.github/workflows/steps/release-build-setup.yml(also cold-cache, release-critical). Left for a follow-up once we know the cause.