Summary
internal/cbus.TestReader_sendMessageOverTheWire intermittently hangs and only ends when Go's
test timeout fires (panic: test timed out). On the "Go Platform compatibility" workflow this is
a red test (21, windows-latest) job, while ubuntu-latest and macos-latest pass in the same
run. It is flaky: it passes when the subtest runs in isolation, and develop's CI is usually green.
This is a pre-existing infrastructure flake, not tied to any PR — it reproduces on a clean,
current origin/develop (verified below). Any PR touching code-generation/**, protocols/**,
or plc4go** triggers this workflow and can be blocked by it.
Reproduction (Windows, Go 1.26.4)
On a clean origin/develop (verified @ 0d5b7164ed):
cd plc4go
go test ./internal/cbus/ -run "TestReader_sendMessageOverTheWire" -count=5 -v -timeout 180s
Reproduced within ~3 min (panic: test timed out after 3m0s, FAIL ... 182.7s).
Notes:
- The single subtest in isolation passes in ~2.00s:
go test ./internal/cbus/ -run 'TestReader_sendMessageOverTheWire/Send_message_which_responds_with_message_to_server$'
- ubuntu/macos do not reproduce it; it appears Windows-specific (timing/scheduler sensitive).
Where it blocks
The test goroutine is parked at the response-wait select in Reader_test.go:1164, after
sendMessageOverTheWire has already returned ("Waiting now" is logged at Reader_test.go:1163).
That select waits for ch, which is closed only inside the mock transaction.FailRequest(...)
(Reader_test.go:527-529). FailRequest is reached only from sendMessageOverTheWire's
handleError callback (Reader.go), which fires when the registered expectation expires (2s TTL).
Root cause area (evidence from a full goroutine dump at the hang)
The expectation timeout never fires. Scoped to the Go runtime stack dump (after the panic line):
- 0
spi/default.(*defaultCodec).ExpireWork goroutines are alive — nothing processes
expectation expiry for the affected codec — while a ReceiveWork goroutine is present.
- Over the run,
Added expectation logs vastly outnumber timeout expectation logs, and in the
hanging iteration the expired expectation produces no timeout expectation at all.
- Not a mutex deadlock (0 goroutines waiting on
expectationsChangeMutex); all other goroutines
are parked in select.
So the affected area is the ExpireWork expectation-timeout worker lifecycle in
plc4go/spi/default/DefaultCodec.go: an expectation can end up registered while no ExpireWork
goroutine is running to time it out → FailRequest is never called → the cbus test blocks
forever. Notably startExpire() is invoked in the hanging iteration ("starting expire worker"
is logged) yet no ExpireWork goroutine is in the dump — so the worker is started but then
absent.
Goroutine dump excerpt (Windows, develop @ 0d5b716, panic: test timed out after 3m0s)
# Test goroutine — parked on the response-wait select, FailRequest never came:
goroutine 1092 [select, 2 minutes]:
...cbus.TestReader_sendMessageOverTheWire.func28
internal/cbus/Reader_test.go:1164 # select { <-t.Context().Done(); <-ch }
# A ReceiveWork goroutine is alive...
goroutine 1230 [select]:
...spi/default.(*defaultCodec).ReceiveWork
spi/default/DefaultCodec.go:474
# ...but NO ExpireWork goroutine exists to expire the 2s expectation.
# Goroutine state histogram (runtime dump, after panic line):
37 [select, 2 minutes] # idle pool/codec workers
3 [chan receive, 2 minutes]
2 [select]
1 [syscall, 2 minutes]
1 [running] # testing timeout alarm (the panic source)
# Counts in the runtime stack dump:
ExpireWork goroutines: 0
ReceiveWork goroutines: present
goroutines on expectationsChangeMutex: 0 (not a mutex deadlock)
Full ~360-line runtime dump + verbose log available on request (I can attach it as a gist).
What I could not determine
- The exact race (worker exited-and-not-restarted vs never-scheduled) — the worker's own trace
logs are disabled by default, so the dump can't distinguish them; re-running with
traceDefaultMessageCodecWorker=true would help.
- A specific introducing commit — a Windows
git bisect would be unreliable given the flakiness.
Note: 1b612551f7 ("stop leaking codec workers on cached reconnect") already touches this same
ExpireWork function, but it only adds ctx.Done() to the idle (no-expectations) select; this
hang occurs with an active expectation, and develop (which contains that commit) still
reproduces the hang — so that change does not address this case.
Environment
- Version:
0.14.0-SNAPSHOT (develop @ 0d5b7164ed)
- Programming language: plc4go
- Protocol: C-Bus
- OS: Windows 11, Go 1.26.4
Summary
internal/cbus.TestReader_sendMessageOverTheWireintermittently hangs and only ends when Go'stest timeout fires (
panic: test timed out). On the "Go Platform compatibility" workflow this isa red
test (21, windows-latest)job, whileubuntu-latestandmacos-latestpass in the samerun. It is flaky: it passes when the subtest runs in isolation, and
develop's CI is usually green.This is a pre-existing infrastructure flake, not tied to any PR — it reproduces on a clean,
current
origin/develop(verified below). Any PR touchingcode-generation/**,protocols/**,or
plc4go**triggers this workflow and can be blocked by it.Reproduction (Windows, Go 1.26.4)
On a clean
origin/develop(verified @0d5b7164ed):Reproduced within ~3 min (
panic: test timed out after 3m0s,FAIL ... 182.7s).Notes:
go test ./internal/cbus/ -run 'TestReader_sendMessageOverTheWire/Send_message_which_responds_with_message_to_server$'Where it blocks
The test goroutine is parked at the response-wait
selectinReader_test.go:1164, aftersendMessageOverTheWirehas already returned ("Waiting now" is logged atReader_test.go:1163).That
selectwaits forch, which is closed only inside the mocktransaction.FailRequest(...)(
Reader_test.go:527-529).FailRequestis reached only fromsendMessageOverTheWire'shandleErrorcallback (Reader.go), which fires when the registered expectation expires (2s TTL).Root cause area (evidence from a full goroutine dump at the hang)
The expectation timeout never fires. Scoped to the Go runtime stack dump (after the panic line):
spi/default.(*defaultCodec).ExpireWorkgoroutines are alive — nothing processesexpectation expiry for the affected codec — while a
ReceiveWorkgoroutine is present.Added expectationlogs vastly outnumbertimeout expectationlogs, and in thehanging iteration the expired expectation produces no
timeout expectationat all.expectationsChangeMutex); all other goroutinesare parked in
select.So the affected area is the
ExpireWorkexpectation-timeout worker lifecycle inplc4go/spi/default/DefaultCodec.go: an expectation can end up registered while noExpireWorkgoroutine is running to time it out →
FailRequestis never called → the cbus test blocksforever. Notably
startExpire()is invoked in the hanging iteration ("starting expire worker"is logged) yet no
ExpireWorkgoroutine is in the dump — so the worker is started but thenabsent.
Goroutine dump excerpt (Windows, develop @ 0d5b716, panic: test timed out after 3m0s)
Full ~360-line runtime dump + verbose log available on request (I can attach it as a gist).
What I could not determine
logs are disabled by default, so the dump can't distinguish them; re-running with
traceDefaultMessageCodecWorker=truewould help.git bisectwould be unreliable given the flakiness.Note:
1b612551f7("stop leaking codec workers on cached reconnect") already touches this sameExpireWorkfunction, but it only addsctx.Done()to the idle (no-expectations) select; thishang occurs with an active expectation, and
develop(which contains that commit) stillreproduces the hang — so that change does not address this case.
Environment
0.14.0-SNAPSHOT(develop @0d5b7164ed)