sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950
sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950robobun wants to merge 2 commits into
Conversation
|
Updated 4:50 PM PT - Jun 12th, 2026
❌ @robobun, your commit e6c829b has 1 failures in
🧪 To try this PR locally: bunx bun-pr 30950That installs a local version of the PR into your bun-30950 --bun |
WalkthroughAdds early-return guards in pooled connection onConnected handlers (MySQL and Postgres) to ignore stale callbacks when the connection state is no longer ChangesSQL pool connection state race condition
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Found 2 issues this PR may fix:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/sql/postgres-close-during-handshake.test.ts`:
- Around line 136-147: Read and assert stderr explicitly before checking payload
state and ensure the process exit code is asserted last: after you obtain
stdout, stderr, and exitCode, add expect(stderr).toBe(""); then parse the last
stdout line into parsed (as already done) and assert expect({ corrupted:
parsed.corrupted }).toEqual({ corrupted: false }); finally assert
expect(exitCode).toEqual(0). Use the existing variables (stdout, stderr,
exitCode, line, parsed) and keep the exit-code assertion as the final check.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: e1da5fe9-6dff-41eb-8f5e-2c3cd366a376
📒 Files selected for processing (1)
test/js/sql/postgres-close-during-handshake.test.ts
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
test/js/sql/postgres-close-during-handshake.test.ts:135-151— nit:stderris collected here but never asserted on or surfaced. If the fixture crashes before printing JSON, the test will fail with{corrupted: undefined, exitCode: <nonzero>}and the actual error will be silently discarded — consider foldingstderrinto thetoEqualobject (or logging it on failure) so CI failures are debuggable.Extended reasoning...
What the issue is
At lines 136-138 the test destructures
stderrfromPromise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]), butstderris never referenced again. The final assertion at lines 148-151 only checks{ corrupted: parsed.corrupted, exitCode }, so whatever the subprocess wrote to stderr is read into a local and dropped.How it manifests
The fixture is expected to print exactly one JSON line on stdout and exit 0. If anything goes wrong before that
console.log(JSON.stringify(...))— a panic in the native PostgresSQLConnection, an uncaught exception in thenet.createServersetup, an assertion failure in a debug build, or a future refactor that introduces a syntax error — the subprocess will write its diagnostic to stderr and exit non-zero, and stdout will be empty.stdout.trim().split("\n").at(-1)then yields"",parsedbecomes{}, and the assertion fails with:expect({ corrupted: undefined, exitCode: 1 }).toEqual({ corrupted: false, exitCode: 0 })That diff tells you that the subprocess crashed, but not why. The actual error (stack trace, panic message, ASAN report) was sitting in
stderrand got thrown away.Why nothing else catches this
The test pipes both stdout and stderr (
stdout: "pipe", stderr: "pipe"), so the subprocess's stderr does not inherit to the test runner's terminal — it is only visible if the test code explicitly surfaces it. There is no other assertion orconsole.error(stderr)path. In CI the only artifact is the jest diff, which contains none of the diagnostic.Convention
Per
test/CLAUDE.md, spawned-process tests in this repo assert on stderr (e.g.expect(stderr).toBe("")) before the exit-code check, or fold it into the snapshot object, precisely so the failure diff carries the real error. The pattern here matches every other field of that convention except this one.Step-by-step proof
- Suppose a future change to
src/js/internal/sql/postgres.tsregresses such thatnew SQL({...})throws synchronously (or the debug build hits anassert). - The fixture process writes the uncaught-exception stack to stderr and exits with code 1, never reaching
console.log. stdoutis"", soline = "",parsed = {},parsed.corrupted = undefined.stderrcontains the full stack trace, but is never read after destructuring.- The test fails with
{ corrupted: undefined, exitCode: 1 } != { corrupted: false, exitCode: 0 }— no hint that the failure was (say)TypeError: Cannot read properties of undefinedrather than the pool-corruption regression this test guards against.
Suggested fix
Fold stderr into the assertion so it appears in the diff:
expect({ stderr, corrupted: parsed.corrupted, exitCode }).toEqual({ stderr: "", corrupted: false, exitCode: 0, });
If the fixture is expected to emit some stderr (e.g. connection-closed warnings), an alternative is to keep the existing assertion but include
stdout/stderrin the object so they show up in the failure diff without being matched against, or simplyconsole.error(stderr)before theexpect.This is purely a debuggability/test-quality concern — it does not affect whether the test passes or fails for the bug being fixed, so filing as a nit.
- Suppose a future change to
There was a problem hiding this comment.
The guard looks correct and the race analysis is convincing, but this is a state-machine change in the production SQL connection pool (Postgres + MySQL) with subtle microtask-vs-sync-callback ordering, so it's worth a maintainer's eye — also note Build #55608 is showing failures.
Extended reasoning...
Overview
This PR adds a single early-return guard at the top of #onConnected in both PooledPostgresConnection (src/js/internal/sql/postgres.ts) and PooledMySQLConnection (src/js/internal/sql/mysql.ts): if this.state !== PooledConnectionState.pending, the callback is dropped. It also adds a ~150-line regression test (test/js/sql/postgres-close-during-handshake.test.ts) that spins up a fake Postgres TCP server which sends the full trust-mode handshake plus an admin-shutdown error and immediately FINs, to deterministically reproduce the race where #onClose runs synchronously before the queued #onConnected microtask drains.
Security risks
None. The guard is purely defensive against an internal callback-ordering race; it doesn't touch auth, TLS, query escaping, or any user-input handling. The test fixture binds to 127.0.0.1:0 in a subprocess.
Level of scrutiny
Medium-high. The runtime change is only ~3 lines of logic per adapter, and tracing the state machine confirms the only legitimate transitions into #onConnected are from pending (constructor and #doRetry() both set state = pending before #startConnection()), so the guard cannot drop a valid callback. #onClose already handles error propagation, onFinish resolution, and adapter.release, so dropping a stale #onConnected after it doesn't leak or hang anything. That said, this is the connection-pool state machine for Bun.SQL — a production-critical path where pool corruption manifests as permanent wedging — and the fix hinges on subtle reasoning about native Rust's queue_microtask(on_connect) vs synchronous on_close dispatch ordering. A maintainer familiar with the native side (e.g. the PostgresSQLConnection lifecycle) should confirm there's no path where on_connect is the only signal and on_close never follows.
Other factors
- robobun flagged failures in Build #55608 on the latest commit; I can't tell from here whether they're related or pre-existing flakes.
- The new test hand-rolls Postgres wire-protocol packets and relies on uSockets delivering handshake+FIN in one poll dispatch; worth a quick check that it's not flaky across platforms.
- coderabbit's only comment (stderr assertion) was addressed in e4c0989 and resolved.
- No CODEOWNERS cover these files.
When a pooled connection's socket is closed in the same I/O tick as the handshake completing (e.g. Postgres ReadyForQuery followed by the server's FIN in one read buffer), handleClose runs synchronously before the queued on_connect microtask. handleConnected then blindly set state = connected and re-added the dead entry to readyConnections with connection === null. Subsequent queries dispatched null to the native query's run() and failed with 'connection must be a PostgresSQLConnection' forever; the pool never retried a slot it thought was live. Bail out of BasePooledConnection.handleConnected when state !== pending. Covers both the Postgres and MySQL adapters. Fixes #30947
554c448 to
7bb1831
Compare
There was a problem hiding this comment.
LGTM — the comment-pointer nit is addressed and the guard is a minimal, well-tested fix.
Extended reasoning...
Overview
This PR adds a 3-line early-return guard to BasePooledConnection.handleConnected in src/js/internal/sql/shared.ts, plus a new regression test (test/js/sql/postgres-close-during-handshake.test.ts). The guard drops stale on_connect microtasks that fire after handleClose has already transitioned the slot to closed, preventing a dead connection from being resurrected into readyConnections with this.connection === null. The test uses a fake TCP server (no Docker) that sends a full trust-mode handshake + admin-shutdown + FIN in one write, reproducing the race deterministically.
Security risks
None. This is a defensive state-machine guard in the JS connection-pool layer. It does not touch auth, TLS, credential handling, or any user-input parsing. The only behavioral change is refusing to transition closed → connected on a stale callback, which strictly narrows the set of accepted state transitions.
Level of scrutiny
Moderate-to-low. While connection-pool state machines are generally subtle, this specific change is purely additive and defensive: it adds one early-return that fires only when state !== pending. I traced all writers of this.state — the constructor, doRetry(), and the connect-failure backoff retry are the only paths that hold pending while awaiting this callback, so the guard cannot drop a legitimate transition. I also checked the MySQL subclass override at mysql.ts:123, which assigns this.connection before calling super; in the race scenario this leaves a stale handle on a closed slot, but that's benign since the slot is never re-added to readyConnections and doRetry() overwrites the handle on the next connect cycle.
Other factors
- No bugs found by the bug-hunting system.
- My prior nit (test header comment pointed at the pre-#32145
postgres.tslocation) was addressed in e6c829b — the header now correctly referencesBasePooledConnection.handleConnectedinshared.ts. - CodeRabbit's stderr-assertion feedback was addressed (folded into the single
toEqualfor better failure diffs). - Files are not covered by CODEOWNERS.
- The PR description includes a detailed root-cause analysis and verification against the related retry/force-close test suites (32 tests, 0 fail).
Fixes #30947.
Repro
The issue's script, against local Postgres:
30-40% failure rate on a
max: 10pool when all connections are closedserver-side while the event loop is blocked (the repro blocks via
Bun.spawnSyncrunningpg_terminate_backend; in production this isa pooler's idle reaper, a network blip, or Supabase's session pooler).
The pool never recovers;
sql.close()afterwards spins at 100% CPU.Cause
The pooled connection's
handleConnectedis wired to the nativeconnection's
on_connectcallback, which Rust schedules as a microtask(
queue_microtask) fromset_status(Connected)after the server sendsReadyForQuery.handleCloseis the paired callback, whichfail_with_js_valuefires synchronously.When the server's FIN lands in the same I/O tick as
ReadyForQuery(uSockets' poll returns
READABLE | EPOLLHUPin one wake, common onloopback or when a pooler closes many idle conns in a batch), uSockets
dispatches
us_dispatch_datathenus_dispatch_endin the sameus_internal_dispatch_ready_pollcall:on_dataparsesReadyForQueryand queueson_connectas a microtask.ErrorResponse, soon_datacallsfail, which runson_closesynchronously.handleClosenullsthis.connection, setsstate = closed,removes the slot from
readyConnections.on_connectmicrotask then fires.handleConnectedblindly overwrote
state = connectedand re-added the entry toreadyConnections, butthis.connectionis stillnull.The next query sees the ghost in
readyConnections,flushConcurrentQueriesdispatches it, andhandle.run(pooledConn.connection /* null */, query)trips the nativeguard:
Retry only runs for slots with
state === closed, so a ghost stuck inconnectedis never refreshed; the pool is permanently wedged.Fix
BasePooledConnection.handleConnected(src/js/internal/sql/shared.ts)refuses stale transitions:
The only legitimate paths into
handleConnectedare the initialconnect,
doRetry(), and the connect-failure backoff retry. All ofthem hold
state = pendingwhile waiting for the callback. Anythingelse is a racing microtask from a native connection whose
on_closealready ran, and must be dropped rather than promoted back into
readyConnections. One guard in the shared base class covers both thePostgres and MySQL adapters.
Rebase note: this PR originally patched
#onConnectedinpostgres.tsandmysql.tsseparately. #32145 consolidated that statemachine into
BasePooledConnectioninshared.ts, so after rebasingthe guard now lives there (one site instead of two). The bug still
reproduced on main after #32028/#32041/#32097/#32145: the regression
test fails with
corrupted: trueon unfixed main.Verification
test/js/sql/postgres-close-during-handshake.test.ts- fake TCP serverthat serves the full trust-mode handshake (AuthOk + ParameterStatus
stack + BackendKeyData + ReadyForQuery + admin-shutdown ErrorResponse)
and FINs every socket. Runs under plain
bun bd test, no Docker.corrupted: true, first iteration), passeswith the fix in ~3.5s
with the guard:
sql-close-pending-connection.test.ts,sql-onconnect-onclose-throw.test.ts,sql-connect-error-reporting.test.ts(32 tests, 0 fail)(release build, real Postgres)