Skip to content

sql: fix pool stall when sql.begin() runs concurrently with pooled queries#32006

Open
robobun wants to merge 3 commits into
mainfrom
farm/25526f47/sql-pool-stall
Open

sql: fix pool stall when sql.begin() runs concurrently with pooled queries#32006
robobun wants to merge 3 commits into
mainfrom
farm/25526f47/sql-pool-stall

Conversation

@robobun

@robobun robobun commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Fixes #32004

Repro

From the issue: a pool (max: 4) running sql.begin() transactions concurrently with plain parameterized sql.unsafe() queries intermittently stalls forever. pg_stat_activity shows every connection idle at ClientRead (one left idle in transaction, its COMMIT never sent), no lock waits, while several sql.unsafe() calls wait for a connection that never comes back. Reproduced on the released build: 4 hangs in 15 runs of the issue's script; 0 in 30 runs with this fix.

Cause

Two cooperating bugs:

  1. Pool handoff leaves the connection visible to the distributor. connect(reserved=true)'s direct path removes the connection from readyConnections, but the release() path that hands a drained connection to a waiting sql.begin() does not, and flushConcurrentQueries() only filtered preReserved (not reserved) connections. Pooled queries therefore kept getting distributed onto a connection a transaction owned, and executed inside that transaction.

  2. Native queue could write out of order. PostgresSQLQuery::run()'s pipelining fast path (!has_query_running() || can_pipeline()) wrote a prepared query's Bind+Execute to the wire even when an earlier queued request was still unwritten, e.g. a simple-protocol COMMIT waiting for pipelined_requests to drain (can_pipeline() only counts written requests). Responses are matched to requests in FIFO queue order, so the unwritten COMMIT consumed the pipelined query's CommandComplete/ReadyForQuery: the transaction "committed" without COMMIT ever reaching the server (the idle in transaction connection), finish_request underflowed nonpipelinable_requests (wraps to u32::MAX in release), and from then on advance() always returned early. The stolen-from query never resolved, so its pool slot leaked and the pool stalled.

Fix

  • src/js/internal/sql/shared.ts (the pool consolidated from the postgres/mysql adapters by sql: consolidate JS pool/connection plumbing into BasePooledConnection in shared.ts #32145): release() removes the connection from readyConnections when handing it to a reserved waiter (mirroring connect()), and flushConcurrentQueries() skips reserved connections.
  • src/sql_jsc/postgres/PostgresSQLQuery.rs: run() no longer writes prepared queries directly; it leaves them Pending and pumps the queue through advance_and_flush(). advance() already enforces every ordering barrier (simple queries wait for the pipeline to drain, prepares wait for can_prepare_query), and still pipelines prepared statements behind in-flight pipelined requests, so wire order always matches queue order. This also deletes run()'s duplicate of advance()'s bind-and-execute block.
  • src/sql_jsc/postgres/PostgresSQLConnection.rs: advance_and_flush()'s gate changes from IS_READY_FOR_QUERY to !WAITING_TO_PREPARE so it can pipeline while requests are in flight (IS_READY_FOR_QUERY implies !WAITING_TO_PREPARE, so every previously allowed state is still allowed). It stays excluded during WAITING_TO_PREPARE because advance() skips requests whose statement is still Parsing. The flush is deferred through the existing auto-flusher so same-tick enqueues still batch into one socket write, and advance()'s in-flight arm gates on can_pipeline() so BUN_FEATURE_FLAG_DISABLE_SQL_AUTO_PIPELINING still guarantees at most one query in flight per connection (review follow-ups, each covered below).

MySQL's native queue has a narrower variant of the same ordering window; tracked in #32005. The pool fix above already applies to the MySQL adapter and removes the transaction-driven trigger there.

Rebase note

Rebased onto main after #32145 consolidated the JS pool/connection plumbing from postgres.ts/mysql.ts into BasePooledConnection/BaseSQLAdapter in shared.ts. The consolidated code carried the same two pool bugs unchanged, so the JS half of this fix now lands once in shared.ts instead of twice in the adapter files; the Rust changes applied cleanly. Re-verified on the new base: the regression test still fails (times out, pool wedged) with src/ reverted to main and passes with the fix.

Verification

test/js/sql/postgres-pool-transaction-stall.test.ts has two tests against a scripted mock postgres server. The second runs a fixture with BUN_FEATURE_FLAG_DISABLE_SQL_AUTO_PIPELINING=1 and asserts a second prepared query's Bind does not reach the server while the first is still in flight (fails with 3 Binds instead of 2 without the can_pipeline() gate in advance()). The first scripts a mock postgres server (no Docker/service needed) that holds one query's response until a control query arrives on a second connection, deterministically forcing the interleaving. On the unfixed build the fixture's pooled query lands on the transaction's connection between BEGIN and COMMIT, its Bind+Execute jumps the queued transaction query, the wedge follows, and the test fails (pool never completes). With the fix the test asserts every query completes, COMMIT actually reaches the server, and no pooled query is written to the transaction's connection between BEGIN and COMMIT.

Also ran the issue's original repro in a loop against real Postgres 17: 30/30 clean with the fix (debug build), 4/15 hangs without it (released build). Full test/js/sql/ suite passes except two pre-existing sqlite test timeouts that fail identically on main with a debug build.

@robobun

robobun commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author
Updated 1:08 PM PT - Jun 12th, 2026

@robobun, your commit cddacc5 has 2 failures in Build #62118 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 32006

That installs a local version of the PR into your bun-32006 executable, so you can run:

bun-32006 --bun

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Found 3 issues this PR may fix:

  1. [Bug] MySQL pool size > 1 causes Bun to hang during sql execution #26235 - MySQL pool size > 1 hangs during transactions; the pool handoff bug leaves the handed-off connection in readyConnections, so flushConcurrentQueries() dispatches pooled queries onto the transaction connection, deadlocking it
  2. MySQL transaction is hanging/freezing on Windows. #25552 - MySQL transactions hang/freeze on Windows; same root cause as [Bug] MySQL pool size > 1 causes Bun to hang during sql execution #26235 where sequential sql.begin() calls with pool size > 1 stall because the connection isn't removed from readyConnections on handoff
  3. Bun.SQL pool permanently corrupted when all pool connections are closed server-side #30947 - SQL pool permanently corrupted when all pool connections are closed server-side; the readyConnections inconsistency from the pool handoff bug contributes to the stale/wrong entries causing connection must be a PostgresSQLConnection errors

If this is helpful, copy the block below into the PR description to auto-close these issues on merge.

Fixes #26235
Fixes #25552
Fixes #30947

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: efdab3eb-76f4-46b5-bbc3-9b12954799a7

📥 Commits

Reviewing files that changed from the base of the PR and between 7e2903c and cddacc5.

📒 Files selected for processing (1)
  • test/js/sql/postgres-pool-transaction-stall.test.ts

Walkthrough

This PR prevents Postgres pool deadlocks by deferring prepared-statement bind/execute to FIFO advancement, gating advance/flush until no backpressure and no pending prepare, and excluding reserved connections from concurrent flush selection. It adds fixtures and deterministic mock-server integration tests validating the fix and pipelining behavior.

Changes

PostgreSQL Pool Concurrency Fix

Layer / File(s) Summary
Connection pool reserved connection isolation
src/js/internal/sql/shared.ts
Connection pool now excludes both preReserved and reserved connections from concurrent flush selection, and release() removes handed-off connections from the ready set so reserved connections are not visible to concurrent flushers.
Postgres advance_and_flush ordering gate
src/sql_jsc/postgres/PostgresSQLConnection.rs
advance_and_flush() now requires no backpressure and no pending prepare (!HAS_BACKPRESSURE && !WAITING_TO_PREPARE) instead of IS_READY_FOR_QUERY; after advance() it schedules a deferred flush via register_auto_flusher() rather than immediate flush_data(). The advance() loop consolidates pipelining eligibility via can_pipeline().
Postgres query execution deferral
src/sql_jsc/postgres/PostgresSQLQuery.rs
Reused prepared statements in Prepared/Parsing/Pending states no longer immediately bind_and_execute; execution is deferred to the advance()/advance_and_flush() path to preserve FIFO ordering and avoid bypassing queued requests.
Auto-pipelining feature-flag fixture
test/js/sql/postgres-pool-pipeline-flag-fixture.ts
Fixture validates that disabling auto-pipelining keeps one in-flight query per connection by holding prepared-statement responses and ensuring subsequent queries remain queued.
Transaction stall reproduction fixture
test/js/sql/postgres-pool-transaction-stall-fixture.ts
Fixture reproduces the transaction stall by creating a single-connection pool, pre-preparing statements, arming control to hold responses, queuing a victim query behind a held bind inside a transaction, and running a fast pooled query to detect desynchronization.
Integration test mock server and validation
test/js/sql/postgres-pool-transaction-stall.test.ts
Integration tests include a scripted mock Postgres server implementing the extended query protocol subset with gated bind responses; tests spawn fixtures, assert ordered lifecycle markers and wire-order invariants, and validate pipelining counters when the feature flag is toggled.

Possibly related issues

Possibly related PRs

  • oven-sh/bun#32145: Modifies pooled-connection concurrency scheduling around reserved vs non-reserved connections; changes here build directly on that pooling plumbing.

Suggested reviewers

  • cirospaciari
  • alii
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'sql: fix pool stall when sql.begin() runs concurrently with pooled queries' directly describes the main fix—resolving a pool stall issue when transactions and pooled queries run concurrently. It is clear, specific, and accurately summarizes the primary change.
Description check ✅ Passed The description fully addresses both template sections: 'What does this PR do?' is thoroughly covered with root cause analysis, fixes, and verification approach; 'How did you verify your code works?' details multiple verification strategies (repro testing, deterministic tests, real Postgres 17 runs). It exceeds basic requirements with detailed technical depth.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@robobun

robobun commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

Checked each of the three suggested issues against this branch before adding any Fixes lines:

Leaving the PR scoped to #32004.

@robobun

robobun commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

CI analysis across the builds on this PR:

  • Build 61454: every failed lane was the @angular/cli bunx breakage, since fixed on main by skipping the test (Skip bunx.test.ts in CI until the Node.js version bump #32042).
  • Build 61650: no test or build failures; 23 jobs expired waiting for CI agents. All 86 jobs that ran passed. Retriggered once.
  • Build 61700: 280 passed; 6 red, all unrelated (pre-existing snapshot.test.ts flake that fails identically on unmodified bun, the streams-leak flake below, a darwin PTY flake, a node:http backpressure flake, one CI VM killed by a cancellation signal, two agent-starvation expiries).
  • Build 62104: 264+ passed; only failure was the streams-leak flake (alpine 3.23 x64).
  • Build 62118 (final, current head): 284 passed, 2 red, both unrelated:
    • debian 13 x64: test/js/web/streams/streams-leak.test.ts ("native ReadableStream reuses the pull buffer across small reads"), the same flake as the two previous builds (three consecutive builds, all retries; passes 3/3 locally on both this branch and unmodified bun). Tracked in Flaky CI test: streams-leak.test.ts "native ReadableStream reuses the pull buffer across small reads" #32190.
    • darwin 26 aarch64: timed out because the CI runner VM received a cancellation signal and was terminated, the same infra kill this lane hit in build 61700 (preceded by darwin-only bun init CLI test flakes, also unrelated to SQL).

Nothing SQL-related has failed on any lane in any build; the postgres-pool-transaction-stall tests passed everywhere they ran, including after the rebase onto the consolidated pool (#32145) and the review follow-ups. The diff is green; the remaining red is tracked flake and CI infra.

…eries

Two cooperating bugs could permanently wedge a postgres pool under
concurrency, leaving connections idle (one "idle in transaction") while
pending queries waited forever:

1. Pool (src/js/internal/sql/shared.ts): release() handed an idle
   connection to a waiting sql.begin() without removing it from
   readyConnections (unlike connect()'s reserved path), and
   flushConcurrentQueries() only filtered preReserved connections.
   Pooled queries kept getting distributed onto a connection a
   transaction owned, executing inside the transaction. The shared
   pool is used by both the postgres and mysql adapters.

2. Native queue: run()'s pipelining fast path wrote a prepared query's
   Bind+Execute to the wire even when an earlier queued request was
   still unwritten (e.g. a simple-protocol COMMIT waiting for the
   pipeline to drain). Responses are matched to requests in FIFO queue
   order, so the unwritten request stole the pipelined query's
   CommandComplete: the transaction "committed" without COMMIT ever
   reaching the server, finish_request underflowed
   nonpipelinable_requests, and advance() returned early forever.
   run() now leaves prepared queries Pending and pumps the queue via
   advance_and_flush(), which writes them in FIFO order and still
   pipelines behind in-flight pipelined requests.

The regression test scripts a mock postgres server that holds one
response until a control query arrives, deterministically forcing the
interleaving that previously stalled the pool.

Fixes #32004

MySQL's native queue has a narrower variant of the ordering window,
tracked in #32005.
@robobun robobun force-pushed the farm/25526f47/sql-pool-stall branch from 08dce70 to 6ec27fb Compare June 12, 2026 16:14
Comment thread src/sql_jsc/postgres/PostgresSQLConnection.rs Outdated
Comment thread src/sql_jsc/postgres/PostgresSQLQuery.rs
Review follow-ups to the queue-ordering fix:

- advance_and_flush() defers the socket flush through the auto-flusher
  (like flush_data_and_reset_timeout) instead of writing immediately, so
  prepared queries enqueued in the same tick coalesce into one socket
  write again.
- advance()'s in-flight arm now gates skipping ahead on can_pipeline(),
  so BUN_FEATURE_FLAG_DISABLE_SQL_AUTO_PIPELINING keeps at most one
  query in flight per connection again (and the write buffer stops
  growing past MAX_PIPELINE_SIZE mid-pipeline). Covered by a new mock
  server test that fails without this gate.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/sql_jsc/postgres/PostgresSQLConnection.rs`:
- Around line 1389-1400: The in-flight close branch calls fail(...) (which via
fail_with_js_value() → ref_and_close() adds an extra poll_ref.ref_()), but only
unrefs once, leaving a dangling ref and potentially pinning the event loop;
update this branch (the block guarded by vm().is_shutting_down() == false &&
status matching Connecting | SentStartupMessage) to use the pre-connected close
path that does not call the ref-adding code or otherwise fully balance refs:
either replace the fail(...) call with a close variant that does not invoke
ref_and_close/ref_ (e.g., a ref_and_close_no_ref or a direct pre-connected close
helper), or, if retaining fail(...), ensure you call poll_ref.with_mut(|r|
r.unref(self.vm_ctx())) one additional time to match the extra ref added by
ref_and_close; reference fail, fail_with_js_value, ref_and_close, and
poll_ref.ref_()/poll_ref.with_mut(|r| r.unref(...)) when making the change.

In `@test/js/sql/postgres-pool-transaction-stall.test.ts`:
- Around line 199-205: The pump(conn: Conn) function can leave conn.busy true if
handleFrame rejects; change the implementation to set conn.busy = true, then run
the frame-processing loop inside a try/finally so conn.busy is always reset to
false, and ensure any errors from handleFrame are not swallowed (remove or
change the surrounding .catch(() => {}) so the rejection is propagated/rewrapped
and causes the test to fail); specifically update pump and the related
data-event handler to rethrow or forward handleFrame errors instead of silencing
them so failures surface.
- Around line 282-311: The test currently sorts the entire milestones list
before comparison (building expect({ steps:
stdout.split(...).filter(...).sort().join("\n"), ...
}).toEqual({...}.sort().join("\n"))), which removes ordering guarantees;
instead, preserve the original sequence from stdout and assert the exact
ordering for all milestones except the one scheduler-dependent pair ("STEP
released" vs "STEP victim resolved"). Change the assertion to use the unsorted
steps (from stdout -> steps) and verify the full sequence matches the expected
milestones in order, but allow the two ambiguous markers by either asserting
that their relative order is one of the two permitted permutations or by
asserting their indices relative to other milestones (e.g., check indexOf("STEP
released") and indexOf("STEP victim resolved") are adjacent and between "STEP
body gate" and "STEP fast resolved"); keep checks for stderr and exitCode
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: baf3815b-3c26-4af3-b51c-b050f1d488b9

📥 Commits

Reviewing files that changed from the base of the PR and between 08dce70 and 7e2903c.

📒 Files selected for processing (6)
  • src/js/internal/sql/shared.ts
  • src/sql_jsc/postgres/PostgresSQLConnection.rs
  • src/sql_jsc/postgres/PostgresSQLQuery.rs
  • test/js/sql/postgres-pool-pipeline-flag-fixture.ts
  • test/js/sql/postgres-pool-transaction-stall-fixture.ts
  • test/js/sql/postgres-pool-transaction-stall.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/sql_jsc/postgres/PostgresSQLConnection.rs`:
- Around line 1389-1400: The in-flight close branch calls fail(...) (which via
fail_with_js_value() → ref_and_close() adds an extra poll_ref.ref_()), but only
unrefs once, leaving a dangling ref and potentially pinning the event loop;
update this branch (the block guarded by vm().is_shutting_down() == false &&
status matching Connecting | SentStartupMessage) to use the pre-connected close
path that does not call the ref-adding code or otherwise fully balance refs:
either replace the fail(...) call with a close variant that does not invoke
ref_and_close/ref_ (e.g., a ref_and_close_no_ref or a direct pre-connected close
helper), or, if retaining fail(...), ensure you call poll_ref.with_mut(|r|
r.unref(self.vm_ctx())) one additional time to match the extra ref added by
ref_and_close; reference fail, fail_with_js_value, ref_and_close, and
poll_ref.ref_()/poll_ref.with_mut(|r| r.unref(...)) when making the change.

In `@test/js/sql/postgres-pool-transaction-stall.test.ts`:
- Around line 199-205: The pump(conn: Conn) function can leave conn.busy true if
handleFrame rejects; change the implementation to set conn.busy = true, then run
the frame-processing loop inside a try/finally so conn.busy is always reset to
false, and ensure any errors from handleFrame are not swallowed (remove or
change the surrounding .catch(() => {}) so the rejection is propagated/rewrapped
and causes the test to fail); specifically update pump and the related
data-event handler to rethrow or forward handleFrame errors instead of silencing
them so failures surface.
- Around line 282-311: The test currently sorts the entire milestones list
before comparison (building expect({ steps:
stdout.split(...).filter(...).sort().join("\n"), ...
}).toEqual({...}.sort().join("\n"))), which removes ordering guarantees;
instead, preserve the original sequence from stdout and assert the exact
ordering for all milestones except the one scheduler-dependent pair ("STEP
released" vs "STEP victim resolved"). Change the assertion to use the unsorted
steps (from stdout -> steps) and verify the full sequence matches the expected
milestones in order, but allow the two ambiguous markers by either asserting
that their relative order is one of the two permitted permutations or by
asserting their indices relative to other milestones (e.g., check indexOf("STEP
released") and indexOf("STEP victim resolved") are adjacent and between "STEP
body gate" and "STEP fast resolved"); keep checks for stderr and exitCode
unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: baf3815b-3c26-4af3-b51c-b050f1d488b9

📥 Commits

Reviewing files that changed from the base of the PR and between 08dce70 and 7e2903c.

📒 Files selected for processing (6)
  • src/js/internal/sql/shared.ts
  • src/sql_jsc/postgres/PostgresSQLConnection.rs
  • src/sql_jsc/postgres/PostgresSQLQuery.rs
  • test/js/sql/postgres-pool-pipeline-flag-fixture.ts
  • test/js/sql/postgres-pool-transaction-stall-fixture.ts
  • test/js/sql/postgres-pool-transaction-stall.test.ts
🛑 Comments failed to post (3)
src/sql_jsc/postgres/PostgresSQLConnection.rs (1)

1389-1400: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Release balancing is incomplete on the in-flight close() branch.

Line 1395 calls fail(...), which goes through fail_with_js_value()ref_and_close() and takes an extra poll_ref.ref_() (Lines 1527-1532). Line 1399 only unrefs once, and this branch explicitly assumes no socket close callback will run, so the extra keepalive ref is left behind. This can pin the event loop after close().

Please use a pre-connected close path that does not add the extra poll_ref ref, or otherwise fully balance both refs in this no-callback path.

As per coding guidelines: “Every error/abort/timeout path actively completes the operation.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/sql_jsc/postgres/PostgresSQLConnection.rs` around lines 1389 - 1400, The
in-flight close branch calls fail(...) (which via fail_with_js_value() →
ref_and_close() adds an extra poll_ref.ref_()), but only unrefs once, leaving a
dangling ref and potentially pinning the event loop; update this branch (the
block guarded by vm().is_shutting_down() == false && status matching Connecting
| SentStartupMessage) to use the pre-connected close path that does not call the
ref-adding code or otherwise fully balance refs: either replace the fail(...)
call with a close variant that does not invoke ref_and_close/ref_ (e.g., a
ref_and_close_no_ref or a direct pre-connected close helper), or, if retaining
fail(...), ensure you call poll_ref.with_mut(|r| r.unref(self.vm_ctx())) one
additional time to match the extra ref added by ref_and_close; reference fail,
fail_with_js_value, ref_and_close, and poll_ref.ref_()/poll_ref.with_mut(|r|
r.unref(...)) when making the change.

Source: Coding guidelines

test/js/sql/postgres-pool-transaction-stall.test.ts (2)

199-205: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t let mock-server frame-handler failures silently wedge the connection.

If handleFrame() rejects once, conn.busy never flips back to false, and the .catch(() => {}) on Line 243 turns that into a silent permanent stall on every later data event for that connection. This makes protocol/parser bugs look like generic test hangs instead of actionable failures.

Suggested hardening
 async function pump(conn: Conn) {
   if (conn.busy) return;
   conn.busy = true;
-  while (conn.frames.length > 0) {
-    await handleFrame(conn, conn.frames.shift()!);
-  }
-  conn.busy = false;
+  try {
+    while (conn.frames.length > 0) {
+      await handleFrame(conn, conn.frames.shift()!);
+    }
+  } finally {
+    conn.busy = false;
+  }
 }
 ...
-      pump(conn).catch(() => {});
+      pump(conn).catch(err => {
+        socket.destroy(err instanceof Error ? err : new Error(String(err)));
+      });

As per coding guidelines, tests should “wire EVERY failure event … to reject” and stay hermetic instead of degrading failures into hangs.

Also applies to: 219-243

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/sql/postgres-pool-transaction-stall.test.ts` around lines 199 - 205,
The pump(conn: Conn) function can leave conn.busy true if handleFrame rejects;
change the implementation to set conn.busy = true, then run the frame-processing
loop inside a try/finally so conn.busy is always reset to false, and ensure any
errors from handleFrame are not swallowed (remove or change the surrounding
.catch(() => {}) so the rejection is propagated/rewrapped and causes the test to
fail); specifically update pump and the related data-event handler to rethrow or
forward handleFrame errors instead of silencing them so failures surface.

Source: Coding guidelines


282-311: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t sort away the ordering guarantees this regression is supposed to prove.

The comment says only "STEP released" vs "STEP victim resolved" is scheduler-dependent, but Lines 285-311 sort the entire milestone list before comparing. That lets unrelated reorderings pass unnoticed, which weakens the regression substantially.

Suggested tightening
-    expect({
-      steps: stdout
-        .split(/\r?\n/)
-        .filter(line => line.startsWith("STEP ") || line === "DONE")
-        .sort()
-        .join("\n"),
+    const steps = stdout
+      .split(/\r?\n/)
+      .filter(line => line.startsWith("STEP ") || line === "DONE");
+
+    const expectedA = [
+      "STEP prepared",
+      "STEP armed",
+      "STEP p0 done",
+      "STEP body gate",
+      "STEP released",
+      "STEP victim resolved",
+      "STEP fast resolved",
+      "STEP slow resolved",
+      "STEP tx resolved",
+      "STEP pool alive",
+      "DONE",
+    ];
+    const expectedB = [
+      "STEP prepared",
+      "STEP armed",
+      "STEP p0 done",
+      "STEP body gate",
+      "STEP victim resolved",
+      "STEP released",
+      "STEP fast resolved",
+      "STEP slow resolved",
+      "STEP tx resolved",
+      "STEP pool alive",
+      "DONE",
+    ];
+
+    expect({
+      steps,
       stderr: stderr.includes("WATCHDOG") ? "WATCHDOG" : "",
       exitCode,
     }).toEqual({
-      steps: [
-        "STEP prepared",
-        "STEP armed",
-        "STEP p0 done",
-        "STEP body gate",
-        "STEP released",
-        "STEP victim resolved",
-        "STEP fast resolved",
-        "STEP slow resolved",
-        "STEP tx resolved",
-        "STEP pool alive",
-        "DONE",
-      ]
-        .sort()
-        .join("\n"),
+      steps: expect.toSatisfy(
+        actual => JSON.stringify(actual) === JSON.stringify(expectedA) || JSON.stringify(actual) === JSON.stringify(expectedB),
+      ),
       stderr: "",
       exitCode: 0,
     });

As per coding guidelines, “Every assertion must be able to fail, and must assert the strongest invariant.”

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    const steps = stdout
      .split(/\r?\n/)
      .filter(line => line.startsWith("STEP ") || line === "DONE");

    const expectedA = [
      "STEP prepared",
      "STEP armed",
      "STEP p0 done",
      "STEP body gate",
      "STEP released",
      "STEP victim resolved",
      "STEP fast resolved",
      "STEP slow resolved",
      "STEP tx resolved",
      "STEP pool alive",
      "DONE",
    ];
    const expectedB = [
      "STEP prepared",
      "STEP armed",
      "STEP p0 done",
      "STEP body gate",
      "STEP victim resolved",
      "STEP released",
      "STEP fast resolved",
      "STEP slow resolved",
      "STEP tx resolved",
      "STEP pool alive",
      "DONE",
    ];

    expect({
      steps: steps.join("\n"),
      stderr: stderr.includes("WATCHDOG") ? "WATCHDOG" : "",
      exitCode,
    }).toEqual(
      expect.objectContaining({
        steps: expect.stringMatching(
          new RegExp(
            `^(${expectedA.join("|").replace(/\./g, "\\.")})$|^(${expectedB.join("|").replace(/\./g, "\\.")})$`,
          ),
        ),
        stderr: "",
        exitCode: 0,
      }),
    );
    const steps = stdout
      .split(/\r?\n/)
      .filter(line => line.startsWith("STEP ") || line === "DONE");
    const stepsStr = steps.join("\n");

    const expectedA = [
      "STEP prepared",
      "STEP armed",
      "STEP p0 done",
      "STEP body gate",
      "STEP released",
      "STEP victim resolved",
      "STEP fast resolved",
      "STEP slow resolved",
      "STEP tx resolved",
      "STEP pool alive",
      "DONE",
    ].join("\n");

    const expectedB = [
      "STEP prepared",
      "STEP armed",
      "STEP p0 done",
      "STEP body gate",
      "STEP victim resolved",
      "STEP released",
      "STEP fast resolved",
      "STEP slow resolved",
      "STEP tx resolved",
      "STEP pool alive",
      "DONE",
    ].join("\n");

    expect(stepsStr === expectedA || stepsStr === expectedB).toBe(true);
    expect(stderr.includes("WATCHDOG") ? "WATCHDOG" : "").toBe("");
    expect(exitCode).toBe(0);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/js/sql/postgres-pool-transaction-stall.test.ts` around lines 282 - 311,
The test currently sorts the entire milestones list before comparison (building
expect({ steps: stdout.split(...).filter(...).sort().join("\n"), ...
}).toEqual({...}.sort().join("\n"))), which removes ordering guarantees;
instead, preserve the original sequence from stdout and assert the exact
ordering for all milestones except the one scheduler-dependent pair ("STEP
released" vs "STEP victim resolved"). Change the assertion to use the unsorted
steps (from stdout -> steps) and verify the full sequence matches the expected
milestones in order, but allow the two ambiguous markers by either asserting
that their relative order is one of the two permitted permutations or by
asserting their indices relative to other milestones (e.g., check indexOf("STEP
released") and indexOf("STEP victim resolved") are adjacent and between "STEP
body gate" and "STEP fast resolved"); keep checks for stderr and exitCode
unchanged.

Source: Coding guidelines

Review follow-ups: pump() resets the busy flag in a finally block and
frame-handler failures destroy the socket and surface through a
mockErrors assertion instead of turning into silent fixture hangs, and
the step-marker comparison asserts the exact sequence, normalizing only
the scheduler-dependent released/victim-resolved pair instead of
sorting the whole list.
@robobun

robobun commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator Author

Re the three CodeRabbit findings (the inline threads failed to post, so answering here):

  1. PostgresSQLConnection.rs close-branch ref balancing: that block is from sql: make forced close() resolve while a pool connection is mid-handshake #32097 (5c02cc1040) and is not touched by this PR, but I probed the claim anyway. A script that drives a pool connection mid-handshake against a never-answering server and calls sql.close({ timeout: "0" }) prints the rejected query's ERR_POSTGRES_CONNECTION_CLOSED and then the process exits cleanly on its own. If ref_and_close()'s poll_ref.ref_() were left dangling on this path, the event loop would stay pinned and the probe would hang; it does not, so the refs balance in practice and there is nothing to fix here.

  2. Mock server pump() resilience: applied in cddacc5. pump() now resets busy in a finally, frame-handler failures destroy the socket (so the fixture fails fast instead of hanging), and both tests assert a mockErrors: [] field so any handler error surfaces with its message.

  3. Step-ordering assertion: applied in cddacc5. The tests now compare the exact step sequence; only the adjacent scheduler-dependent pair ("STEP released" / "STEP victim resolved") is normalized before the comparison instead of sorting the whole list.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both follow-ups from my earlier pass (the can_pipeline() gate in the Running/Binding arm and the auto-flusher batching) are applied in 7e2903c and covered by the new pipeline-flag test; nothing further from me, but this reworks the native queue's write-ordering gates enough that a maintainer should sign off.

Extended reasoning...

Overview

The PR fixes a pool stall in Bun.SQL postgres when sql.begin() runs concurrently with pooled prepared queries. JS side (shared.ts): two small, clearly-correct lines — release() removes the connection from readyConnections before handing it to a reserved waiter, and flushConcurrentQueries() filters reserved as well as preReserved. Rust side: PostgresSQLQuery::do_run() no longer writes Bind+Execute inline for already-prepared statements and instead defers to advance(); advance_and_flush()'s gate widens from IS_READY_FOR_QUERY to !WAITING_TO_PREPARE and now schedules the deferred auto-flusher; the Running/Binding arm in advance() now early-returns on !can_pipeline(). Three new test files add a scripted mock-postgres server and two deterministic regression tests.

Security risks

None identified. No auth, crypto, permissions, or untrusted-input parsing is touched. The change affects client-side write ordering and pool bookkeeping; the worst failure mode is a hang or query running inside the wrong transaction (the bug being fixed), not an exposure.

Level of scrutiny

High. This is concurrency-sensitive wire-ordering logic in a production database driver. The advance_and_flush() gate change is a semantic widening (it now runs while pipelined requests are in flight), and removing the do_run() fast-path shifts every reused-prepared-statement execution onto the advance() state machine. The reasoning in the PR description and code comments is sound and I traced the new ordering invariants (FIFO write order, can_pipeline() subsuming the old WAITING_TO_PREPARE/nonpipelinable checks, auto-flusher restoring same-tick batching), but the interaction surface — simple queries, named vs unnamed prepared statements, the disable-pipelining flag, backpressure — is large enough that a maintainer familiar with this queue should review it rather than auto-approve.

Other factors

Since my earlier inline comments, 7e2903c applied both suggested fixes exactly as proposed and added a dedicated test asserting the feature flag keeps one query in flight (verified can_pipeline() at line 1595 does cover WAITING_TO_PREPARE + nonpipelinable_requests, so the consolidated gate at line 2296 is equivalent-or-stricter). cddacc5 hardened the mock-server assertions per CodeRabbit. No CODEOWNERS entry covers these paths. CI failures on the latest build are infra (musl LTO link, agent provisioning), not test failures. The bug-hunting pass on the current head found nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SQL: connection pool stalls under concurrent begin() + parameterized queries (idle connections never handed to pending queries)

1 participant