Skip to content

[Bug]: plc4go internal/cbus TestReader_sendMessageOverTheWire intermittently hangs until test timeout (Windows) #2611

Description

@LivingLikeKrillin

Summary

internal/cbus.TestReader_sendMessageOverTheWire intermittently hangs and only ends when Go's
test timeout fires (panic: test timed out). On the "Go Platform compatibility" workflow this is
a red test (21, windows-latest) job, while ubuntu-latest and macos-latest pass in the same
run. It is flaky: it passes when the subtest runs in isolation, and develop's CI is usually green.

This is a pre-existing infrastructure flake, not tied to any PR — it reproduces on a clean,
current origin/develop (verified below). Any PR touching code-generation/**, protocols/**,
or plc4go** triggers this workflow and can be blocked by it.

Reproduction (Windows, Go 1.26.4)

On a clean origin/develop (verified @ 0d5b7164ed):

cd plc4go
go test ./internal/cbus/ -run "TestReader_sendMessageOverTheWire" -count=5 -v -timeout 180s

Reproduced within ~3 min (panic: test timed out after 3m0s, FAIL ... 182.7s).
Notes:

  • The single subtest in isolation passes in ~2.00s:
    go test ./internal/cbus/ -run 'TestReader_sendMessageOverTheWire/Send_message_which_responds_with_message_to_server$'
  • ubuntu/macos do not reproduce it; it appears Windows-specific (timing/scheduler sensitive).

Where it blocks

The test goroutine is parked at the response-wait select in Reader_test.go:1164, after
sendMessageOverTheWire has already returned ("Waiting now" is logged at Reader_test.go:1163).
That select waits for ch, which is closed only inside the mock transaction.FailRequest(...)
(Reader_test.go:527-529). FailRequest is reached only from sendMessageOverTheWire's
handleError callback (Reader.go), which fires when the registered expectation expires (2s TTL).

Root cause area (evidence from a full goroutine dump at the hang)

The expectation timeout never fires. Scoped to the Go runtime stack dump (after the panic line):

  • 0 spi/default.(*defaultCodec).ExpireWork goroutines are alive — nothing processes
    expectation expiry for the affected codec — while a ReceiveWork goroutine is present.
  • Over the run, Added expectation logs vastly outnumber timeout expectation logs, and in the
    hanging iteration the expired expectation produces no timeout expectation at all.
  • Not a mutex deadlock (0 goroutines waiting on expectationsChangeMutex); all other goroutines
    are parked in select.

So the affected area is the ExpireWork expectation-timeout worker lifecycle in
plc4go/spi/default/DefaultCodec.go
: an expectation can end up registered while no ExpireWork
goroutine is running to time it out → FailRequest is never called → the cbus test blocks
forever. Notably startExpire() is invoked in the hanging iteration ("starting expire worker"
is logged) yet no ExpireWork goroutine is in the dump — so the worker is started but then
absent.

Goroutine dump excerpt (Windows, develop @ 0d5b716, panic: test timed out after 3m0s)
# Test goroutine — parked on the response-wait select, FailRequest never came:
goroutine 1092 [select, 2 minutes]:
  ...cbus.TestReader_sendMessageOverTheWire.func28
    internal/cbus/Reader_test.go:1164      # select { <-t.Context().Done(); <-ch }

# A ReceiveWork goroutine is alive...
goroutine 1230 [select]:
  ...spi/default.(*defaultCodec).ReceiveWork
    spi/default/DefaultCodec.go:474

# ...but NO ExpireWork goroutine exists to expire the 2s expectation.

# Goroutine state histogram (runtime dump, after panic line):
   37 [select, 2 minutes]      # idle pool/codec workers
    3 [chan receive, 2 minutes]
    2 [select]
    1 [syscall, 2 minutes]
    1 [running]                 # testing timeout alarm (the panic source)

# Counts in the runtime stack dump:
ExpireWork goroutines:                 0
ReceiveWork goroutines:                present
goroutines on expectationsChangeMutex: 0   (not a mutex deadlock)

Full ~360-line runtime dump + verbose log available on request (I can attach it as a gist).

What I could not determine

  • The exact race (worker exited-and-not-restarted vs never-scheduled) — the worker's own trace
    logs are disabled by default, so the dump can't distinguish them; re-running with
    traceDefaultMessageCodecWorker=true would help.
  • A specific introducing commit — a Windows git bisect would be unreliable given the flakiness.

Note: 1b612551f7 ("stop leaking codec workers on cached reconnect") already touches this same
ExpireWork function, but it only adds ctx.Done() to the idle (no-expectations) select; this
hang occurs with an active expectation, and develop (which contains that commit) still
reproduces the hang — so that change does not address this case.

Environment

  • Version: 0.14.0-SNAPSHOT (develop @ 0d5b7164ed)
  • Programming language: plc4go
  • Protocol: C-Bus
  • OS: Windows 11, Go 1.26.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions