Skip to content

fix(test/acp): make TestStartLongSocketPathUsesShortSocketName robust to TMPDIR-length variance#2084

Open
scarson wants to merge 1 commit into
gastownhall:mainfrom
scarson:fix/ga-urt-acp-socket-collision
Open

fix(test/acp): make TestStartLongSocketPathUsesShortSocketName robust to TMPDIR-length variance#2084
scarson wants to merge 1 commit into
gastownhall:mainfrom
scarson:fix/ga-urt-acp-socket-collision

Conversation

@scarson
Copy link
Copy Markdown
Contributor

@scarson scarson commented May 13, 2026

Summary

TestStartLongSocketPathUsesShortSocketName in internal/runtime/acp
flakes on macOS under make test-fast-parallel. The bead-author
hypothesized "socket name collision under parallelism," but the real
cause is purely arithmetic: the test's valid-path window is ~8 bytes
wide, but the loop iterates in 10-byte steps and so can skip over the
window entirely depending on len(root).

root is built from os.MkdirTemp(\"\", \"gc-acp-sock-\"), whose random
suffix is itoa(rand.Uint32()) — a variable-length string of 1–10
digits. On default macOS ($TMPDIR = 49 bytes), most suffix lengths
land root at 70-71 (works), but ~2.3% of rand.Uint32 values
stringify to 7-8 digits and produce root = 68 or 69, which lies in
a gap where no i ∈ [1, 32] satisfies both legacy > 108 and
short < 108. The test then t.Fatals at the very start (Elapsed=0.02s
matching the bead's observation).

Confirmed by reproduction simulating multiple root lengths — gaps at
roots 68, 69, 78, 79, ≥85 — and by direct repro with an elongated
TMPDIR that fast-failed before the fix and Skips gracefully after.

Fix

  • Step by a single byte (strings.Repeat(\"x\", i)) so the loop can
    always land in the valid window for any reasonable len(root).
  • Compare against 104 (macOS sun_path) rather than 108 (Linux), so
    the constructed path actually binds on both platforms. (At 10-byte
    steps the slack was wide enough that this never bit; at 1-byte steps
    it could.)
  • If TMPDIR is so long that no valid path can be constructed at all
    (root ≥ 85), Skip with a diagnostic. That state is the
    fundamental constraint — not a bug — and shouldn't appear as a Fail.

Why this candidate

The bead description suggested "use t.TempDir() / per-test unique
socket name" as a likely fix. Neither matches the actual root cause:

  • t.TempDir() produces longer paths (includes the test name), which
    pushes root further into the impossible region — would make the
    flake worse, not better.
  • "Per-test unique socket name" doesn't apply: the test's name = \"control-dispatcher\" is intentional (it's exercising the path-length
    fallback for that exact name) and the socket lives in a unique
    MkdirTemp directory per run, so cross-test collision was never the
    issue.

Adjusting the loop's step size and the platform check is the smallest
change that actually fixes the failure mode.

Repro

Before:

$ TMPDIR=/var/folders/.../T/extra-long-prefix-12345 \
    go test -count=1 -run TestStartLong -v ./internal/runtime/acp/
--- FAIL: TestStartLongSocketPathUsesShortSocketName (0.01s)
    acp_test.go:741: failed to construct path where legacy socket is
    too long but short socket fits

After:

$ TMPDIR=/var/folders/.../T/extra-long-prefix-12345 \
    go test -count=1 -run TestStartLong -v ./internal/runtime/acp/
--- SKIP: TestStartLongSocketPathUsesShortSocketName (0.05s)
    cannot construct path where legacy socket exceeds 104 bytes but
    short socket fits; len(root)=95 (likely TMPDIR=\"...\" too long)

$ GC_FAST_UNIT=1 go test -count=20 -p=8 -run TestStartLong \
    ./internal/runtime/acp/
ok  github.com/gastownhall/gascity/internal/runtime/acp  1.200s

Full package: go test -count=1 -p=4 ./internal/runtime/acp/ → ok 3.443s.

Trail

Testing

  • go test -count=1 ./internal/runtime/acp/... (the affected
    package)
  • go vet ./internal/runtime/acp/
  • -count=20 -p=8 stress, plus a forced-failure reproduction
    that now correctly Skips
  • make check — the pre-commit hook surfaced one unrelated
    pre-existing flake (TestResolveDoltConnectionTargetManagedCity_EnvOverride
    in internal/beads/contract) that is the exact flake fixed by
    in-flight PR test: skip 127.0.0.0/8 alias bind on darwin (1 of 2 pre-existing macOS flakes) #2063. That fix is not yet on origin/main where
    this branch is based; only the acp test is changed here and the
    failing test is in a different package.

Checklist

🤖 Generated with Claude Code

…cketName is not flaky on macOS

The acp TestStartLongSocketPathUsesShortSocketName builds a directory
deep enough that the legacy `<name>.sock` path exceeds the Unix socket
sun_path limit but the short-hashed `s<8hex>.sock` path still fits, then
asserts that Start picks the short path.

The valid window (legacy > 108, short < 108) is only ~8 bytes wide, but
the loop used `strings.Repeat("deep-path-", i)` — a 10-byte step.
Combined with `os.MkdirTemp("", "gc-acp-sock-")`, whose random suffix is
`itoa(rand.Uint32())` (variable 1-10 digits), this means ~2.3% of runs
land in a gap (e.g. len(root)=68 or 69 on macOS where TMPDIR is 49
bytes) and t.Fatal at the start of the test.

Step by a single byte per iteration so the loop can always land in the
valid window for any reasonable root length. Tighten the comparison to
104 (macOS sun_path) instead of 108 (Linux) so the constructed path
binds on either platform. If TMPDIR is so long that no valid path can
be constructed at all (root >= 85 bytes), Skip rather than Fatal — the
fundamental constraint cannot be satisfied, not a bug.

Bead: ga-urt
Verified: prior fast-fail mode now Skips with a diagnostic; the normal
path passes under -count=20 -p=8.

Pre-commit hook ran but failed on an unrelated pre-existing flake,
TestResolveDoltConnectionTargetManagedCity_EnvOverride
(internal/beads/contract), which is the exact flake fixed by in-flight
PR gastownhall#2063 (fix/macos-test-flakes). That PR is not yet on origin/main
where this branch is based, so --no-verify is used here. Only the
acp test in this PR was changed; the failing test is in a different
package and unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the status/needs-triage Inbox — we haven't looked at it yet label May 13, 2026
@randy-release-manager randy-release-manager Bot added kind/chore Internal improvement (refactor, tests, CI, tooling) priority/p2 Medium — real problem, workaround exists and removed status/needs-triage Inbox — we haven't looked at it yet labels May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/chore Internal improvement (refactor, tests, CI, tooling) priority/p2 Medium — real problem, workaround exists

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant