Skip to content

sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950

Open
robobun wants to merge 2 commits into
mainfrom
farm/f6fb4a56/sql-pool-stale-onconnected
Open

sql: ignore stale on_connect firing after #onClose in pooled Postgres/MySQL#30950
robobun wants to merge 2 commits into
mainfrom
farm/f6fb4a56/sql-pool-stale-onconnected

Conversation

@robobun

@robobun robobun commented May 17, 2026

Copy link
Copy Markdown
Collaborator

Fixes #30947.

Repro

The issue's script, against local Postgres:

$ DATABASE_URL=postgres://… bun bug.ts
[killed] {"n":"10"}
[q0] {"err":"connection must be a PostgresSQLConnection"}
[q1] {"err":"connection must be a PostgresSQLConnection"}
[q2] {"err":"connection must be a PostgresSQLConnection"}
[q3] {"err":"connection must be a PostgresSQLConnection"}
[q4] {"err":"connection must be a PostgresSQLConnection"}

30-40% failure rate on a max: 10 pool when all connections are closed
server-side while the event loop is blocked (the repro blocks via
Bun.spawnSync running pg_terminate_backend; in production this is
a pooler's idle reaper, a network blip, or Supabase's session pooler).
The pool never recovers; sql.close() afterwards spins at 100% CPU.

Cause

The pooled connection's handleConnected is wired to the native
connection's on_connect callback, which Rust schedules as a microtask
(queue_microtask) from set_status(Connected) after the server sends
ReadyForQuery. handleClose is the paired callback, which
fail_with_js_value fires synchronously.

When the server's FIN lands in the same I/O tick as ReadyForQuery
(uSockets' poll returns READABLE | EPOLLHUP in one wake, common on
loopback or when a pooler closes many idle conns in a batch), uSockets
dispatches us_dispatch_data then us_dispatch_end in the same
us_internal_dispatch_ready_poll call:

  1. on_data parses ReadyForQuery and queues on_connect as a microtask.
  2. The same read carried the admin-shutdown ErrorResponse, so
    on_data calls fail, which runs on_close synchronously.
    handleClose nulls this.connection, sets state = closed,
    removes the slot from readyConnections.
  3. The queued on_connect microtask then fires. handleConnected
    blindly overwrote state = connected and re-added the entry to
    readyConnections, but this.connection is still null.

The next query sees the ghost in readyConnections,
flushConcurrentQueries dispatches it, and
handle.run(pooledConn.connection /* null */, query) trips the native
guard:

// src/sql_jsc/postgres/PostgresSQLQuery.rs
let Some(connection) = postgres_sql_connection::js::from_js_ref(arguments[0]) else {
    return Err(global_object.throw(format_args!("connection must be a PostgresSQLConnection")));
};

Retry only runs for slots with state === closed, so a ghost stuck in
connected is never refreshed; the pool is permanently wedged.

Fix

BasePooledConnection.handleConnected (src/js/internal/sql/shared.ts)
refuses stale transitions:

if (this.state !== PooledConnectionState.pending) {
  return;
}

The only legitimate paths into handleConnected are the initial
connect, doRetry(), and the connect-failure backoff retry. All of
them hold state = pending while waiting for the callback. Anything
else is a racing microtask from a native connection whose on_close
already ran, and must be dropped rather than promoted back into
readyConnections. One guard in the shared base class covers both the
Postgres and MySQL adapters.

Rebase note: this PR originally patched #onConnected in
postgres.ts and mysql.ts separately. #32145 consolidated that state
machine into BasePooledConnection in shared.ts, so after rebasing
the guard now lives there (one site instead of two). The bug still
reproduced on main after #32028/#32041/#32097/#32145: the regression
test fails with corrupted: true on unfixed main.

Verification

test/js/sql/postgres-close-during-handshake.test.ts - fake TCP server
that serves the full trust-mode handshake (AuthOk + ParameterStatus
stack + BackendKeyData + ReadyForQuery + admin-shutdown ErrorResponse)
and FINs every socket. Runs under plain bun bd test, no Docker.

  • fails on unfixed main (corrupted: true, first iteration), passes
    with the fix in ~3.5s
  • the retry/force-close suites from the recent pool refactor all pass
    with the guard: sql-close-pending-connection.test.ts,
    sql-onconnect-onclose-throw.test.ts,
    sql-connect-error-reporting.test.ts (32 tests, 0 fail)
  • original repro script: ~30% failure without fix, 0% with fix
    (release build, real Postgres)

@robobun

robobun commented May 17, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 4:50 PM PT - Jun 12th, 2026

@robobun, your commit e6c829b has 1 failures in Build #62146 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 30950

That installs a local version of the PR into your bun-30950 executable, so you can run:

bun-30950 --bun

@coderabbitai

coderabbitai Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Adds early-return guards in pooled connection onConnected handlers (MySQL and Postgres) to ignore stale callbacks when the connection state is no longer pending, and adds a Postgres regression test that simulates a server close during handshake to validate the fix.

Changes

SQL pool connection state race condition

Layer / File(s) Summary
Connection state guards in handlers
src/js/internal/sql/mysql.ts, src/js/internal/sql/postgres.ts
PooledMySQLConnection.#onConnected and PooledPostgresConnection.#onConnected add early-return checks to ensure the pooled connection is still pending before processing the ReadyForQuery/on_connect callback; if not pending, the callback returns early.
Regression test for mid-handshake close
test/js/sql/postgres-close-during-handshake.test.ts
Adds a fixture script and test that spawn a fake Postgres server which completes the startup handshake then immediately sends an admin-shutdown and closes. The test runs multiple pooled SELECT 1 queries and asserts the pool is not corrupted and the fixture exits successfully.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding guards to ignore stale on_connect callbacks after #onClose in pooled Postgres/MySQL connections.
Linked Issues check ✅ Passed The changes directly address issue #30947 by adding state checks to prevent stale on_connect callbacks from corrupting the connection pool when connections close server-side during handshake.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing the connection pool corruption issue: guards in Postgres/MySQL handlers and a regression test for the specific failure scenario.
Description check ✅ Passed PR description is comprehensive and well-structured, covering issue reference, root cause analysis, fix explanation, and verification details.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

Copy link
Copy Markdown
Contributor

Found 2 issues this PR may fix:

  1. Postgres connection leak #23215 - Unbounded connection growth beyond configured max is consistent with the race where #onConnected re-adds dead connections to readyConnections, causing the pool to create replacements without accounting for corrupted entries
  2. Crash in createInstance in Postgres client in Bun.SQL #24434 - Crash in PostgresSQLConnection__createInstance triggered from a microtask matches the race where a pending #onConnected microtask fires after #onClose has already nulled this.connection

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #23215
Fixes #24434

🤖 Generated with Claude Code

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/sql/postgres-close-during-handshake.test.ts`:
- Around line 136-147: Read and assert stderr explicitly before checking payload
state and ensure the process exit code is asserted last: after you obtain
stdout, stderr, and exitCode, add expect(stderr).toBe(""); then parse the last
stdout line into parsed (as already done) and assert expect({ corrupted:
parsed.corrupted }).toEqual({ corrupted: false }); finally assert
expect(exitCode).toEqual(0). Use the existing variables (stdout, stderr,
exitCode, line, parsed) and keep the exit-code assertion as the final check.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: e1da5fe9-6dff-41eb-8f5e-2c3cd366a376

📥 Commits

Reviewing files that changed from the base of the PR and between 2edc9e4 and 4cb22a7.

📒 Files selected for processing (1)
  • test/js/sql/postgres-close-during-handshake.test.ts

Comment thread test/js/sql/postgres-close-during-handshake.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 test/js/sql/postgres-close-during-handshake.test.ts:135-151 — nit: stderr is collected here but never asserted on or surfaced. If the fixture crashes before printing JSON, the test will fail with {corrupted: undefined, exitCode: <nonzero>} and the actual error will be silently discarded — consider folding stderr into the toEqual object (or logging it on failure) so CI failures are debuggable.

    Extended reasoning...

    What the issue is

    At lines 136-138 the test destructures stderr from Promise.all([proc.stdout.text(), proc.stderr.text(), proc.exited]), but stderr is never referenced again. The final assertion at lines 148-151 only checks { corrupted: parsed.corrupted, exitCode }, so whatever the subprocess wrote to stderr is read into a local and dropped.

    How it manifests

    The fixture is expected to print exactly one JSON line on stdout and exit 0. If anything goes wrong before that console.log(JSON.stringify(...)) — a panic in the native PostgresSQLConnection, an uncaught exception in the net.createServer setup, an assertion failure in a debug build, or a future refactor that introduces a syntax error — the subprocess will write its diagnostic to stderr and exit non-zero, and stdout will be empty. stdout.trim().split("\n").at(-1) then yields "", parsed becomes {}, and the assertion fails with:

    expect({ corrupted: undefined, exitCode: 1 }).toEqual({ corrupted: false, exitCode: 0 })
    

    That diff tells you that the subprocess crashed, but not why. The actual error (stack trace, panic message, ASAN report) was sitting in stderr and got thrown away.

    Why nothing else catches this

    The test pipes both stdout and stderr (stdout: "pipe", stderr: "pipe"), so the subprocess's stderr does not inherit to the test runner's terminal — it is only visible if the test code explicitly surfaces it. There is no other assertion or console.error(stderr) path. In CI the only artifact is the jest diff, which contains none of the diagnostic.

    Convention

    Per test/CLAUDE.md, spawned-process tests in this repo assert on stderr (e.g. expect(stderr).toBe("")) before the exit-code check, or fold it into the snapshot object, precisely so the failure diff carries the real error. The pattern here matches every other field of that convention except this one.

    Step-by-step proof

    1. Suppose a future change to src/js/internal/sql/postgres.ts regresses such that new SQL({...}) throws synchronously (or the debug build hits an assert).
    2. The fixture process writes the uncaught-exception stack to stderr and exits with code 1, never reaching console.log.
    3. stdout is "", so line = "", parsed = {}, parsed.corrupted = undefined.
    4. stderr contains the full stack trace, but is never read after destructuring.
    5. The test fails with { corrupted: undefined, exitCode: 1 } != { corrupted: false, exitCode: 0 } — no hint that the failure was (say) TypeError: Cannot read properties of undefined rather than the pool-corruption regression this test guards against.

    Suggested fix

    Fold stderr into the assertion so it appears in the diff:

    expect({ stderr, corrupted: parsed.corrupted, exitCode }).toEqual({
      stderr: "",
      corrupted: false,
      exitCode: 0,
    });

    If the fixture is expected to emit some stderr (e.g. connection-closed warnings), an alternative is to keep the existing assertion but include stdout/stderr in the object so they show up in the failure diff without being matched against, or simply console.error(stderr) before the expect.

    This is purely a debuggability/test-quality concern — it does not affect whether the test passes or fails for the bug being fixed, so filing as a nit.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guard looks correct and the race analysis is convincing, but this is a state-machine change in the production SQL connection pool (Postgres + MySQL) with subtle microtask-vs-sync-callback ordering, so it's worth a maintainer's eye — also note Build #55608 is showing failures.

Extended reasoning...

Overview

This PR adds a single early-return guard at the top of #onConnected in both PooledPostgresConnection (src/js/internal/sql/postgres.ts) and PooledMySQLConnection (src/js/internal/sql/mysql.ts): if this.state !== PooledConnectionState.pending, the callback is dropped. It also adds a ~150-line regression test (test/js/sql/postgres-close-during-handshake.test.ts) that spins up a fake Postgres TCP server which sends the full trust-mode handshake plus an admin-shutdown error and immediately FINs, to deterministically reproduce the race where #onClose runs synchronously before the queued #onConnected microtask drains.

Security risks

None. The guard is purely defensive against an internal callback-ordering race; it doesn't touch auth, TLS, query escaping, or any user-input handling. The test fixture binds to 127.0.0.1:0 in a subprocess.

Level of scrutiny

Medium-high. The runtime change is only ~3 lines of logic per adapter, and tracing the state machine confirms the only legitimate transitions into #onConnected are from pending (constructor and #doRetry() both set state = pending before #startConnection()), so the guard cannot drop a valid callback. #onClose already handles error propagation, onFinish resolution, and adapter.release, so dropping a stale #onConnected after it doesn't leak or hang anything. That said, this is the connection-pool state machine for Bun.SQL — a production-critical path where pool corruption manifests as permanent wedging — and the fix hinges on subtle reasoning about native Rust's queue_microtask(on_connect) vs synchronous on_close dispatch ordering. A maintainer familiar with the native side (e.g. the PostgresSQLConnection lifecycle) should confirm there's no path where on_connect is the only signal and on_close never follows.

Other factors

  • robobun flagged failures in Build #55608 on the latest commit; I can't tell from here whether they're related or pre-existing flakes.
  • The new test hand-rolls Postgres wire-protocol packets and relies on uSockets delivering handshake+FIN in one poll dispatch; worth a quick check that it's not flaky across platforms.
  • coderabbit's only comment (stderr assertion) was addressed in e4c0989 and resolved.
  • No CODEOWNERS cover these files.

When a pooled connection's socket is closed in the same I/O tick as the
handshake completing (e.g. Postgres ReadyForQuery followed by the server's
FIN in one read buffer), handleClose runs synchronously before the queued
on_connect microtask. handleConnected then blindly set state = connected
and re-added the dead entry to readyConnections with connection === null.
Subsequent queries dispatched null to the native query's run() and failed
with 'connection must be a PostgresSQLConnection' forever; the pool never
retried a slot it thought was live.

Bail out of BasePooledConnection.handleConnected when state !== pending.
Covers both the Postgres and MySQL adapters.

Fixes #30947
@robobun robobun force-pushed the farm/f6fb4a56/sql-pool-stale-onconnected branch from 554c448 to 7bb1831 Compare June 12, 2026 18:37
Comment thread test/js/sql/postgres-close-during-handshake.test.ts Outdated

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the comment-pointer nit is addressed and the guard is a minimal, well-tested fix.

Extended reasoning...

Overview

This PR adds a 3-line early-return guard to BasePooledConnection.handleConnected in src/js/internal/sql/shared.ts, plus a new regression test (test/js/sql/postgres-close-during-handshake.test.ts). The guard drops stale on_connect microtasks that fire after handleClose has already transitioned the slot to closed, preventing a dead connection from being resurrected into readyConnections with this.connection === null. The test uses a fake TCP server (no Docker) that sends a full trust-mode handshake + admin-shutdown + FIN in one write, reproducing the race deterministically.

Security risks

None. This is a defensive state-machine guard in the JS connection-pool layer. It does not touch auth, TLS, credential handling, or any user-input parsing. The only behavioral change is refusing to transition closed → connected on a stale callback, which strictly narrows the set of accepted state transitions.

Level of scrutiny

Moderate-to-low. While connection-pool state machines are generally subtle, this specific change is purely additive and defensive: it adds one early-return that fires only when state !== pending. I traced all writers of this.state — the constructor, doRetry(), and the connect-failure backoff retry are the only paths that hold pending while awaiting this callback, so the guard cannot drop a legitimate transition. I also checked the MySQL subclass override at mysql.ts:123, which assigns this.connection before calling super; in the race scenario this leaves a stale handle on a closed slot, but that's benign since the slot is never re-added to readyConnections and doRetry() overwrites the handle on the next connect cycle.

Other factors

  • No bugs found by the bug-hunting system.
  • My prior nit (test header comment pointed at the pre-#32145 postgres.ts location) was addressed in e6c829b — the header now correctly references BasePooledConnection.handleConnected in shared.ts.
  • CodeRabbit's stderr-assertion feedback was addressed (folded into the single toEqual for better failure diffs).
  • Files are not covered by CODEOWNERS.
  • The PR description includes a detailed root-cause analysis and verification against the related retry/force-close test suites (32 tests, 0 fail).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bun.SQL pool permanently corrupted when all pool connections are closed server-side

1 participant