Skip to content

Commit fb370f8

Browse files
committed
Add fail-fast timeout and diagnostics to cargo hakari disable
cargo hakari disable in the docker image build has hung intermittently for 28-70 minutes (e.g. run 26514779485 attempt 1 sat on it for 28m45s before being cancelled). Crucially, cargo printed no output for the entire 28 min, so we do not yet know which phase actually stalled - it could be package-cache lock, libcurl/TLS init, DNS, TCP setup, sparse index download, or a git fetch. Without that, any "fix" is speculation. This change is intentionally limited to two things, both about making the next hang useful: * timeout-minutes: 5 - the step now fails fast at 5 min instead of occupying the 70-min job slot. Healthy cold-cache runs finish in ~2 min (worst non-stall observed ~3 min), so 5 min keeps a comfortable margin. * verbose cargo logging plus a pre-kill watchdog snapshot - CARGO_TERM_VERBOSE, CARGO_HTTP_DEBUG, and a targeted CARGO_LOG give Rust-level and libcurl-level visibility; at 4m30s a background snapshot dumps the live process tree (with wait-channel) and open TCP sockets so a silent hang still leaves evidence of what cargo was actually doing. Once a future hang reveals the stall point, the real fix can be targeted. No release note: internal CI infrastructure only.
1 parent 1e8e1f3 commit fb370f8

1 file changed

Lines changed: 24 additions & 1 deletion

File tree

.github/workflows/docker.yml

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -98,9 +98,32 @@ jobs:
9898
with:
9999
tool: cargo-hakari
100100

101+
# This step has hung intermittently in CI for 28-70 min with no cargo output,
102+
# so the stalled phase is unknown. The timeout caps how long a future stall
103+
# can hold the runner; the verbose env and pre-kill snapshot exist so the
104+
# next hang leaves enough trace to target a real fix.
101105
- name: Disable hakari
102106
if: ${{ hashFiles('.config/hakari.toml') != '' }}
103-
run: cargo hakari disable
107+
timeout-minutes: 5
108+
env:
109+
CARGO_TERM_VERBOSE: "true"
110+
CARGO_HTTP_DEBUG: "true"
111+
CARGO_LOG: "cargo::core::package=info,cargo::ops::registry=info,cargo::sources::registry=info,cargo::sources::git=info,cargo::util::network=info"
112+
run: |
113+
(
114+
sleep 270
115+
echo "::group::diagnostic snapshot at 4m30s (cargo still running)"
116+
echo "--- process tree (pid/ppid/elapsed/state/wait-channel/cmd) ---"
117+
ps -eo pid,ppid,etime,stat,wchan:25,args 2>/dev/null \
118+
| awk 'NR==1 || /cargo|hakari|[g]it/'
119+
echo "--- open TCP sockets for cargo/git ---"
120+
(ss -tnp 2>/dev/null || netstat -tnp 2>/dev/null) \
121+
| awk 'NR<=2 || /cargo|hakari|[g]it/'
122+
echo "::endgroup::"
123+
) &
124+
watchdog=$!
125+
trap 'kill "$watchdog" 2>/dev/null; wait "$watchdog" 2>/dev/null || true' EXIT
126+
cargo hakari disable
104127
105128
# this is needed to be able to load and push a multiplatform image in one step
106129
- name: Set up Docker containerd snapshotter

0 commit comments

Comments
 (0)