Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ members = [
"pageserver/pagebench",
"pageserver/page_api",
"proxy",
"proxy-cplane-api",
"safekeeper",
"safekeeper/client",
"storage_broker",
Expand Down
87 changes: 87 additions & 0 deletions docs/review/00-overview.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# Review pack — `mel/pool-fixes-and-benchmarks`

You're reviewing five commits sitting on top of Charles's session-pool
WIP commit (`6e871ba05`). Three pieces of work, in dependency order:

1. **Fix the session-pool half-close bug** so reused connections actually
work (`6a361a996`).
2. **Multi-endpoint routing** so the mock control plane can return
different compute addresses for different endpoint IDs
(`fa654b133`).
3. **Benchmark sweep** comparing direct / proxy+pool / proxy without pool,
plus FINDINGS.md (`9ce39d144`).
4. **Transaction-mode pool multiplexing** — release on every
`ReadyForQuery 'I'`, re-acquire from the pool for the next
transaction (`29f0bc6d2`).
5. **Benchmark update** adding a `proxy_txn` configuration and section 5
in FINDINGS.md (`ef3db262b`).

## Reading order

1. **`01-session-pool-fix.md`** — start here. Smallest commit, cleanest
bug story. The fix is ~30 lines but the chain of reasoning that got
there is half the value.
2. **`02-multi-endpoint.md`** — quick read; mostly CLI plumbing and a
lookup-priority decision in `mock.rs`.
3. **`03-benchmarks.md`** — methodology, findings, and the things the
numbers actually say (vs. what they superficially look like). Read
`benchmarks/FINDINGS.md` alongside.
4. **`04-transaction-mode.md`** — the bigger architectural change. Read
the `ReadyForQueryWatcher` and `proxy_pass_transaction_mode`
sections; skim the rest. Compare against `pgcat/src/client.rs`'s
two-loop structure if you want the canonical reference.
5. **`05-open-questions.md`** — the weak points and decisions I'd
challenge if I were reviewing me. Read this last — it'll change
how you read everything else.

## High-level commit map

```
ef3db262b benchmarks: add proxy_txn config and update FINDINGS
29f0bc6d2 transaction-mode pool multiplexing
9ce39d144 benchmarks: pool/no-pool/direct sweep with FINDINGS
fa654b133 multi-endpoint routing via --compute-endpoint-map
6a361a996 session pooling works: don't poll_shutdown(compute) on client Terminate
6e871ba05 basic bb8 pooling <- Charles's last commit
```

`git log --stat --reverse 6e871ba05..HEAD` gives the per-commit diff
sizes. Five commits, ~1200 lines added, ~50 removed.

## What's in scope vs out of scope

| In scope | Out of scope |
| ---------------------------------------------- | ---------------------------------- |
| Session-pool correctness (the half-close bug) | Pool eviction / health checks |
| Per-endpoint routing in the mock control plane | Real cplane integration |
| Transaction-mode multiplexing | `SET` / `LISTEN` detection |
| pgbench sweep + FINDINGS | Cancellation across multiplexed conns |
| Memory measurement at peak | Auth caching |

The five "out of scope" items are all real follow-ups; some are noted in
`05-open-questions.md`.

## How to run things locally

Already documented in `benchmarks/README.md`. The shortest reproducer:

```bash
# Build
cargo build --features testing --bin proxy

# Smoke session pool
RUST_LOG="proxy::tcp_pool=info" PGPASSWORD=testpw target/debug/proxy \
--auth-backend=postgres \
--auth-endpoint='postgresql://proxytest@localhost:5432/proxytest_db' \
--compute-endpoint='postgresql://localhost:5433/proxytest_db?sslmode=disable' \
--proxy=127.0.0.1:4432 --mgmt=127.0.0.1:7000 --http=127.0.0.1:7001 --wss=127.0.0.1:7002 \
--tcp-pool-enabled=true --tcp-pool-max-conns-per-key=5 --tcp-pool-fallback-direct-connect=true

# Run 5 sessions and check the pid is stable (session pooling works)
for i in 1 2 3 4 5; do
PGPASSWORD=testpw psql "postgresql://proxytest@127.0.0.1:4432/proxytest_db?sslmode=disable&options=endpoint%3Dep-test-123" \
-c "SELECT $i, pg_backend_pid()"
done

# Add --tcp-pool-mode=transaction to see multiplex behaviour
```
124 changes: 124 additions & 0 deletions docs/review/01-session-pool-fix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# 01 — Session-pool half-close bug

Commit `6a361a996` "session pooling works: don't poll_shutdown(compute)
on client Terminate". Three files, ~190 lines net.

## TL;DR

When a client cleanly disconnected (Postgres `'X'` Terminate), the
existing `TerminateFilter` correctly **didn't forward** the Terminate
to compute, but the surrounding bidirectional-copy state machine
**still ran `poll_shutdown` on the compute writer** as part of its
"reader hit EOF, shut down the writer" cleanup path. That's
`shutdown(SHUT_WR)` on the live compute socket. Postgres saw the FIN,
closed its side, and the next pool reuse hit `ENOTCONN`.

Fix: a parallel `transfer_one_direction_no_shutdown` that goes
`Running → Done` directly, used for the client→compute direction in
the pooled bidir copy. Compute's writer is never half-closed.

## The chain of reasoning that got here

This took several iterations to localize. The diagnostic timeline
(captured by adding probes that I later removed) was:

```
20.247745 tcp pool: opened new connection
20.249371 TerminateFilter: intercepted Postgres Terminate
20.249403 transfer_one_direction: entering ShuttingDown ← BUG
20.249433 poll_shutdown(writer) returned Ready(Ok) ← FIN sent
20.249507 poll_write(empty) error: Broken pipe ← write half dead
20.249559 tcp pool: returning connection ← broken conn pooled
[iter 2, 243ms later]
20.493396 peer_addr error: Invalid argument ← peer gone
20.493689 per-client task IO error: Socket is not connected
```

The bug is that `TerminateFilter::poll_read` returns `Ready(Ok(()))`
with **zero bytes filled** when it intercepts `'X'`. `CopyBuffer`
treats zero-fill as EOF (`me.read_done = me.cap == filled_len`), and
`poll_copy` returns `Ready(Ok(amt))`. `transfer_one_direction` then
transitions:

```
Running → ShuttingDown → poll_shutdown(writer) → Done
```

That `poll_shutdown(writer)` is on the compute side. It's the bug.

The early `saw_terminate()` check the existing pooled bidir copy
**did** have was checked *after* `transfer_one_direction` returned —
too late, the shutdown had already happened inside the call.

## The fix

`transfer_one_direction_no_shutdown` is the same state machine minus
the `ShuttingDown` step:

```rust
TransferState::Running(buf) => {
let count = ready!(buf.poll_copy(cx, r.as_mut(), w.as_mut()))?;
*state = TransferState::Done(count); // skip ShuttingDown
}
TransferState::ShuttingDown(count) => {
*state = TransferState::Done(*count); // dead branch, but keep
}
TransferState::Done(count) => return Poll::Ready(Ok(*count)),
```

In `copy_bidirectional_client_compute_pooled`, swap the
client→compute call to the no-shutdown variant:

```rust
let client_to_compute_result = transfer_one_direction_no_shutdown(
cx, &mut client_to_compute, &mut filtered_client, compute,
).map_err(ErrorSource::from_client)?;
```

The compute→client direction keeps using the regular variant — if
compute really closes, shutting down the client side is correct.

## Files

- `proxy/src/pglb/copy_bidirectional.rs` — the fix
- `proxy/src/pglb/passthrough.rs` — unchanged (probes removed)
- `proxy/src/proxy/mod.rs`, `proxy/src/tcp_pool.rs` — small unrelated
prep work in this commit (was-reused signal, pool key on checkout
via `Option<...>` tuple)

## Read in this order

1. `proxy/src/pglb/copy_bidirectional.rs:43-95` — read both
`transfer_one_direction` and `transfer_one_direction_no_shutdown`
side by side. The diff is exactly the deletion of the
`poll_shutdown` call.
2. `proxy/src/pglb/copy_bidirectional.rs:120-235` — read
`copy_bidirectional_client_compute_pooled` and find the call site
that uses the no-shutdown variant. The CRITICAL comment is a
load-bearing invariant.
3. `proxy/src/pglb/copy_bidirectional.rs::TerminateFilter` (around
line 460) — to understand what the filter actually emits when it
intercepts `'X'`. That's the input the bidir copy is responding
to.

## Things I'd challenge in review

- **The `ShuttingDown` arm in `_no_shutdown` is dead code.** It exists
to keep the state machine total. Pragmatic but a bit ugly. An
alternative is a separate state enum.
- **Why not just have `TerminateFilter` return `Pending` instead of
EOF on intercept?** I considered this in passing — it would avoid
the EOF-cascade entirely. The downside is the filter would need to
self-wake or we'd hang. Strictly cleaner but adds wake-up plumbing.
Worth challenging.
- **`compute→client` still uses the shutdown-variant.** That's fine
*today* because compute closing means the conn is dead anyway. But
it does mean a slow-loris compute (sends a few bytes, then EOFs)
half-closes the client. In our case the client is being torn down
anyway. Edge-case unlikely to matter in practice.
- **The probes I used to find this bug were temporary** (zero-byte
`poll_write` and `peer_addr` checks at release/reuse, plus a
`entering ShuttingDown` log line). I stripped them before the
commit. If you want to see them, look at the diff history for the
fix on `mel/pool-fixes-and-benchmarks` — it's in the conversational
history but was never committed.
101 changes: 101 additions & 0 deletions docs/review/02-multi-endpoint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# 02 — Multi-endpoint routing

Commit `fa654b133` "multi-endpoint routing via --compute-endpoint-map".
Two files, ~57 lines added.

## TL;DR

Charles's earlier `--compute-endpoint` flag was a global override:
every `wake_compute` call used the same compute address regardless of
the endpoint ID in the connection string. This commit adds
`--compute-endpoint-map` to the mock control plane, threads the
`EndpointId` from `ComputeUserInfo` into `do_wake_compute`, and uses
a three-tier lookup so the existing single-tenant test keeps working.

## Lookup priority (in `do_wake_compute`)

```
self.compute_endpoint_map.get(endpoint) // per-endpoint map (new)
.or(self.compute_endpoint.as_ref()) // global override (existing)
.unwrap_or(&self.auth_endpoint) // legacy fallback (existing)
```

This means:
- Endpoint IDs in the map → routed to their mapped address
- Endpoint IDs not in the map → fall through to `--compute-endpoint`
if set
- Otherwise → `--auth-endpoint` (the original behaviour)

So adding a map doesn't change behaviour for IDs that aren't in it.
That kept Charles's single-tenant flow untouched.

## Flag format

```
--compute-endpoint-map='ep-A=postgresql://localhost:5433/db?sslmode=disable,ep-B=postgresql://localhost:5434/db?sslmode=disable'
```

Comma-separated entries; first `=` separates endpoint id from URL.
URLs reuse the same `parse_compute_endpoint` validator (must include
hostname, sslmode if present must be `disable` or `require`).

## Files

- `proxy/src/control_plane/client/mock.rs` — `compute_endpoint_map:
HashMap<EndpointId, ApiUrl>` field on `MockControlPlane`,
`do_wake_compute(&EndpointId)` lookup, `wake_compute` plumb-through.
- `proxy/src/binary/proxy.rs` — `--compute-endpoint-map` clap arg,
`parse_compute_endpoint_map` parser.

## Read in this order

1. `proxy/src/control_plane/client/mock.rs::do_wake_compute` —
the lookup logic (the chained `or` / `unwrap_or` is the whole
feature).
2. `proxy/src/control_plane/client/mock.rs::wake_compute` (in the
`ControlPlaneApi` impl) — confirms `user_info.endpoint` is what
gets passed in.
3. `proxy/src/binary/proxy.rs::parse_compute_endpoint_map` — the
parser. ~20 lines, validates duplicates and missing separators.

## How it was verified

Two docker Postgres instances (`compute-A` on 5433, `compute-B` on
5434), each with a `marker` table containing `'compute-A'` /
`'compute-B'`. Same SCRAM verifier copied from auth Postgres so
SCRAM-passthrough works on both. Then:

```bash
# Through the proxy with the map:
psql ".../?options=endpoint%3Dep-A" -c "SELECT val FROM marker"
# returns: compute-A

psql ".../?options=endpoint%3Dep-B" -c "SELECT val FROM marker"
# returns: compute-B
```

Plus 5x `pg_backend_pid()` per endpoint to confirm pool isolation:
each endpoint's pool is keyed independently by `(endpoint, db, role)`,
so within `ep-A` the pid is stable (pool reuse) and across A/B the pids
differ (different computes).

## Things I'd challenge in review

- **Why not a database-backed mapping?** The brief noted that real
Neon uses `neon_control_plane.endpoints`. I picked the CLI-flag
approach because it's faster to iterate on and doesn't require
schema changes; the brief explicitly suggested this trade-off.
A real cplane would query the db.
- **`HashMap` lookup is sync inside `do_wake_compute`'s async fn.**
Fine — the map is read-only after startup. If it ever became
mutable, would need a `RwLock`.
- **No support for SSL-mode beyond `disable`/`require`.** Inherited
from Charles's `parse_compute_endpoint`. Anything unrecognised
errors at parse time.
- **Endpoint id in the auth context is actually `EndpointId` (a
smol_str wrapper),** so the HashMap key works without allocations.
Look at `proxy/src/types.rs::smol_str_wrapper!` to confirm.
- **Cancellation correctness**: I haven't audited whether
cross-endpoint cancellation routes correctly. Probably fine because
`BackendKeyData` is per-session and the cancel path doesn't go
through `wake_compute`. But worth confirming if you're paranoid.
Loading