fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths by andypost · Pull Request #8 · andypost/unit

andypost · 2026-05-07T22:44:36Z

Summary

Fixes latent leaks in the cert/script/socket/access-log IPC reply paths, all reachable when nxt_port_msg_alloc() (or the RPC stream-id pool) hits malloc failure inside the port machinery. Pre-existing in upstream Unit; identified during review of #6.

The fix is intentionally narrow — no new helper, no protocol changes — so it's a clean candidate to forward to freeunitorg/freeunit after review here.

Leak A — sender side: unbalanced `nxt_mp_retain`

nxt_cert_store_get / nxt_script_store_get issued nxt_mp_retain(mp) before nxt_port_socket_write. Failure paths between the retain and a successful send (stream alloc failure, socket_write failure) left mp->retain permanently incremented. The buffer's completion handler — the only thing that calls nxt_mp_release — never runs in those paths. Since nxt_mp_destroy gates on if (mp->retain == 0) (src/nxt_mp.c:302), the entire pool stays resident.

Concrete impact: mp is the router_temp_conf mem pool — owns the entire pending config (sockets, bundles, routes, app refs). One unbalanced retain pins the whole config until process exit.

Fix: move nxt_mp_retain to after successful socket_write, matching the existing correct pattern in nxt_router_access_log_reopen (src/nxt_router_access_log.c:579). Buffer completion is dispatched via nxt_work_queue_add and runs only after the current handler returns, so retaining synchronously after the send is race-free.

Leak B — receiver side: fd not closed when send fails

nxt_cert_store_get_handler, nxt_script_store_get_handler, nxt_main_port_socket_handler, and nxt_main_port_access_log_handler all reply to RPC requests with NXT_PORT_MSG_CLOSE_FD. Inside the port layer the fd is closed by nxt_port_msg_close_fd() at three sites (nxt_port_socket.c:455, :519, :1361). The hole: when nxt_port_msg_alloc inside nxt_port_msg_chk_insert returns NXT_ERROR, the message never enters port->messages and nxt_port_error_handler never sees it. The (void) cast on the call meant the handler couldn't react even if the return signaled failure.

Concrete impact: a leaked file descriptor in the privileged main process, one per failure. Cert PEM, script blob, listening socket, or access-log file — visible in /proc/$PID/fd, accumulates over reload churn, eventually hits RLIMIT_NOFILE. The socket-listen reply additionally leaked a small diagnostic buffer from the engine mem_pool.

Fix: capture the nxt_port_socket_write return value; on != NXT_OK close the fd explicitly using the appropriate closer (nxt_socket_close for the listening socket, nxt_file_close for the access-log file, nxt_fd_close for the cert/script handlers — see comment-thread on nxt_file_close's %FN UAF risk after nxt_free(file.name)). Run out->completion_handler(...) on the diagnostic buffer in the socket-listen path.

Files changed

File	Change
`src/nxt_cert.c`	move retain after `socket_write`; close fd on receiver send failure
`src/nxt_script.c`	move retain after `socket_write`; close fd on receiver send failure
`src/nxt_main_process.c`	close listening socket + reclaim diagnostic buffer in `nxt_main_port_socket_handler`; close access-log file in `nxt_main_port_access_log_handler`
`CHANGES`	bugfix note

+84 / −10 across 4 files. No protocol or config-surface changes.

Test plan

Configure clean (./configure --openssl && ./configure python)
Full build clean
pytest test/test_tls.py test/test_tls_sni.py — 29 passed (TLS reload paths exercise cert_store_get sender side)
pytest test/test_configuration.py test/test_access_log.py test/test_tls.py — 63 passed; 2 pre-existing IPv6 environment failures ([::1]:8082 — Address family not supported by protocol), confirmed unrelated by re-running on stock master via git stash.

A deterministic test for the leak paths themselves would need malloc-failure injection (LD_PRELOAD shim or AddressSanitizer with allocator hooks) — deferred. The triggers are rare in practice; the fix is small enough to review on inspection.

Upstream

Both leak shapes live in upstream code (freeunitorg/freeunit); after merging here the same diff should be forwarded upstream. PR #6 (OCSP stapling) inherits the same shapes for its OCSP twin functions and has been rebased onto this pattern in commit 52c9b54.

Generated by Claude Code

SSL_ERROR_SYSCALL(errno=0) and SSL_ERROR_ZERO_RETURN both indicate the peer closed the connection. On the read path this is a clean EOF; on the write path it means no further data can be sent and the loop must terminate. Before this fix the write path fell through to the read-side handler, setting socket.closed=1 and returning success, causing the router to retry SSL_write indefinitely until the event engine timed out. - SSL_ERROR_SYSCALL + WRITE: use sys_err if non-zero, else ECONNRESET - SSL_ERROR_ZERO_RETURN + WRITE: always ECONNRESET Also bumps OpenSSL in CI from 3.6.0 to 3.6.2. Closes nginx#28 test: fix process filter for single-digit PIDs in containers Substring match `main_pid in l` false-positives when unit gets a low PID (e.g. 9) in Docker: "9" matches "2189" in zombie entries. Word-boundary regex \b<pid>\b fixes the check - Remove TIPC (domain 40) deny rule — profile replaces Docker's default entirely when --security-opt is used, so only the targeted AF_ALG block belongs here (single-purpose profile) - README: clarify mitigation ≠ fix, add explicit "not applied automatically" warning, fix verify examples to use latest-python3.13-slim (latest-minimal has no python3) - CI: run seccomp tests against both python:3.13-slim-trixie and ghcr.io/freeunitorg/freeunit:latest-python3.13-slim - test script: drop TIPC test (rule removed), update comment

Address review feedback on the issue nginx#28 fix: * test_tls.py: replace the broad `SSL_write.+failed` skip with the specific syscall/zero-return signatures this test produces, so unrelated SSL_write regressions are not silently masked. * test_tls.py: bump the response body from 1 MB to 16 MB so the server is reliably mid-write when the client tears the connection down on hosts with large autotuned SO_SNDBUF. * conftest.py: pre-compile the main-PID match and use re.escape() for safety; minor cleanup of the per-line search. * nxt_openssl.c: drop the redundant `!= 0` in the ternary to match surrounding style. No functional change to the TLS fix itself.

Address review feedback on the issue nginx#28 fix: test_tls.py: replace the broad SSL_write.+failed skip with the specific syscall/zero-return signatures this test produces, so unrelated SSL_write regressions are not silently masked. test_tls.py: bump the response body from 1 MB to 16 MB so the server is reliably mid-write when the client tears the connection down on hosts with large autotuned SO_SNDBUF. conftest.py: pre-compile the main-PID match and use re.escape() for safety; minor cleanup of the per-line search. nxt_openssl.c: drop the redundant != 0 in the ternary to match surrounding style. No functional change to the TLS fix itself.

fix(tls): stop SSL_write busy-loop on peer-initiated close

gemini-code-assist

Code Review

This pull request addresses memory pool and file descriptor leaks in the IPC and reply paths by deferring memory retention and ensuring resources are closed if port writing fails. The review feedback identifies several critical issues: missing initialization of b->data which would lead to crashes in completion handlers, potential buffer leaks when socket writes fail, and inconsistent use of file closing APIs where nxt_file_close should be preferred over nxt_fd_close.

andypost · 2026-05-07T22:58:06Z

During the pedantic audit of PR #7 (P1 graceful-shutdown plumbing) I caught two leak findings — one is the same pattern your gemini-review thread on src/nxt_main_process.c:1184 already flagged, and one is unrelated upstream baggage. Recording them here so neither is lost between PRs.

1. Buffer-leak pattern extends to three additional sites (same shape as gemini's review comment)

The gemini review thread on nxt_main_process.c:1184 correctly flags that when nxt_port_socket_write(... b) returns non-OK, the engine-mem_pool-backed buffer is never released by anyone — it leaks until process exit because no completion handler ever runs.

The same pattern is now in three new sites introduced by PR #7:

src/nxt_runtime.c:511 — nxt_runtime_stop_app_processes() cascade
src/nxt_runtime.c:533 — nxt_runtime_stop_all_processes() cascade
src/nxt_application.c:716 — nxt_proto_quit_children() cascade

All three call (void) nxt_port_socket_write(... NXT_PORT_MSG_QUIT, ..., b) where b is a 1-byte buffer allocated from task->thread->engine->mem_pool via nxt_runtime_quit_buf(). The (void) cast discards the failure indicator, so the same leak shape applies — though the blast radius is small (≤1 byte per failed send, only on the QUIT cascade during shutdown, mem_pool is destroyed at process exit).

When PR #8 generalises the "free-on-send-failure" pattern (or once the gemini suggestion lands as committed code), it would be cleanest to do the same audit across git grep "(void) nxt_port_socket_write" so all in-tree sites get the same treatment, including these three. Worth a follow-up rather than scope creep into PR #8.

If you want to handle it directly in PR #8, the simplest shape that matches your existing fix:

b = nxt_runtime_quit_buf(task, rt->quit_mode);

if (nxt_port_socket_write(task, port, NXT_PORT_MSG_QUIT, -1, 0, 0, b) != NXT_OK
    && b != NULL)
{
    b->completion_handler(task, b, b->parent);
}

(Mirrors the gemini suggestion on nxt_main_process.c:1184. The b != NULL guard is required because nxt_runtime_quit_buf returns NULL on the NORMAL fast-exit path by design — no allocation, nothing to release.)

Not blocking PR #7; happy to do a follow-up either here or on a separate port-quit-cascade-leaks branch once PR #8 lands and establishes the pattern.

2. Pre-existing 136-byte LeakSanitizer finding in `nxt_var_index_init`

While running PR #7's tests under ASan I caught this leak, which reproduces on plain origin/master with no PR #7 or PR #8 changes — different subsystem, different cause:

==13218==ERROR: LeakSanitizer: detected memory leaks
Direct leak of 136 byte(s) in 1 object(s) allocated from:
    #1 nxt_memalign           src/nxt_malloc.c:134
    #2 nxt_var_index_init     src/nxt_var.c:296
    #3 nxt_runtime_create     src/nxt_runtime.c:134
    #4 main                   src/nxt_main.c:27

nxt_var_index_init allocates a hash structure during nxt_runtime_create and never frees it. Out of scope for this PR (different file, different subsystem) but worth filing as its own issue / branch — useful when the lifecycle work in PR #7 + later phases enables ASan-clean shutdown. I haven't checked whether this is a true leak (lifetime-of-process allocation) or a missing teardown, but it does fail LeakSanitizer: detect_leaks=1 on a clean run.

Let me know if you'd prefer the three QUIT-cascade sites folded into PR #8 or kept as a follow-up.

Generated by Claude Code

andypost · 2026-05-07T23:08:56Z

from #6 and #7 fix.md

andypost · 2026-05-07T23:16:16Z

/gemini review

gemini-code-assist

Code Review

This pull request addresses memory pool retention leaks and file descriptor leaks within the certificate, script, and configuration store IPC paths. The fixes involve delaying memory pool retention until after successful hand-off to the port machinery and implementing explicit cleanup for file descriptors and buffers when port write operations fail. Review feedback suggests simplifying a conditional check in nxt_main_port_socket_handler by removing a redundant flag check, as the file descriptor's validity already implies the necessary state in that context.

andypost · 2026-05-07T23:34:17Z

/gemini review

Phase 1 of roadmap/plan-graceful-shutdown.md (on the `roadmap` branch). Splits the previously-identical SIGTERM and SIGQUIT handlers in src/nxt_main_process.c and plumbs the choice as a quit_param byte through NXT_PORT_MSG_QUIT so libunit's already-implemented nxt_unit_quit(NXT_PORT_QUIT_GRACEFUL) actually gets invoked. The plumbing also propagates through the prototype -> child cascade so SIGQUIT to the unitd master delivers GRACEFUL semantics to every libunit context, not just the children main contacts directly. Wire format ----------- A new public enum nxt_port_quit_mode_t in nxt_port.h sits next to NXT_PORT_MSG_QUIT itself: NXT_PORT_QUIT_NORMAL = 0 /* fast exit, drop in-flight */ NXT_PORT_QUIT_GRACEFUL = 1 /* drain in-flight before exit */ libunit at nxt_unit.c:1056-1070 already parses this byte; it falls back to NXT_PORT_QUIT_NORMAL when the message arrives without a payload. We therefore send NO payload on the fast-exit path and exactly one byte on the graceful path -- both directions of the asymmetry are the safe ones: * Pre-P1 senders (and the seven NXT_PORT_MSG_QUIT call sites deliberately left NULL in the audit below) keep producing the safe NORMAL behaviour with no wire change. * Allocation failure under GRACEFUL silently degrades to NORMAL with a NXT_LOG_WARN entry so the operator sees that the cascade leg fell back to fast exit under memory pressure. * SIGTERM, the fast-exit path by definition, performs zero additional allocations on its way to nxt_runtime_quit(). Skipping the allocation on the NORMAL path (and thus on the seven deliberately-NULL call sites) is gemini-code-assist's review suggestion on the initial PR; this commit lands the squashed result. Prototype cascade ----------------- nxt_runtime_stop_all_processes() (called from main on SIGQUIT) walks rt->processes and sends NXT_PORT_MSG_QUIT to every port, including each app worker AND the prototype. The prototype then runs nxt_proto_quit_handler() which previously cascaded a *second* QUIT message to its children with a NULL payload -- libunit defaulted that to NORMAL, creating a race: whichever message reached a child first decided GRACEFUL vs NORMAL behaviour. On a busy or slow box the cascade could win and silently downgrade SIGQUIT to a fast exit. The prototype handler now reads the quit_param byte from main's QUIT message and forwards it through nxt_proto_quit_children() unchanged. Both messages now agree, the race is benign, and GRACEFUL reaches every child regardless of arrival order. Unknown payload bytes (anything other than 0 or 1) are normalised to NORMAL on read so a malformed sender cannot propagate a bogus byte through the worker pool. Buffer ownership ---------------- nxt_runtime_quit_buf() returns NULL for NORMAL (no allocation) or a one-byte mem_pool buffer for GRACEFUL. All three send sites (nxt_runtime_stop_app_processes, nxt_runtime_stop_all_processes, nxt_proto_quit_children) capture the nxt_port_socket_write return value and explicitly run b->completion_handler when send fails and b is non-NULL -- otherwise the GRACEFUL payload would leak from the engine memory pool, the same shape PR #8 fixes for the cert/script/conf-store IPC paths. src/ changes ------------ * src/nxt_main_process.c -- SIGTERM now sets rt->quit_mode = NXT_PORT_QUIT_NORMAL; SIGQUIT sets NXT_PORT_QUIT_GRACEFUL. The /* TODO: fast exit */ and /* TODO: graceful exit */ comments are gone. * src/nxt_runtime.h -- new uint8_t quit_mode next to other small flags (no struct bloat). Documented as a nxt_port_quit_mode_t. Exports nxt_runtime_quit_buf() so nxt_application.c can use the same allocator. * src/nxt_runtime.c -- nxt_runtime_quit_buf(task, quit_param) returns NULL for NORMAL (no allocation), one-byte buffer for GRACEFUL, and a NXT_LOG_WARN + NULL on alloc failure. Both nxt_runtime_stop_app_processes() and nxt_runtime_stop_all_processes() call it with rt->quit_mode and release the buffer on send failure. * src/nxt_application.c -- nxt_proto_quit_handler() reads the quit_param byte from msg, normalises unknown values to NORMAL, and forwards it via the new nxt_proto_quit_children(task, quit_param) signature with the same buffer-release-on-failure handling. Direct signal handler nxt_proto_sigterm_handler() passes NXT_PORT_QUIT_NORMAL explicitly: signals to the prototype are not the user-initiated lifecycle path (that is main -> NXT_PORT_MSG_QUIT) and the historical fast-exit semantics are preserved. * src/nxt_port.h -- promotes nxt_port_quit_mode_t to a public enum alongside NXT_PORT_MSG_QUIT. * src/nxt_unit.c -- existing local NXT_QUIT_NORMAL / NXT_QUIT_GRACEFUL identifiers (used at 10+ libunit call sites) become #define aliases of the public names so a compile-time mismatch between the daemon-side and libunit-side values is structurally impossible -- the preprocessor substitutes the same enum value into every reference. No churn at the call sites. NXT_PORT_MSG_QUIT call-site audit --------------------------------- src/nxt_runtime.c:511 stop_app_processes plumbed (rt->quit_mode) + buffer release on send failure src/nxt_runtime.c:533 stop_all_processes plumbed (rt->quit_mode) + buffer release on send failure src/nxt_application.c:716 proto_quit_children plumbed (cascaded byte) + buffer release on send failure src/nxt_main_process.c:1038 orphan reaping NULL (defensive cleanup) src/nxt_router.c:932 prototype replaced NULL (P6 territory) src/nxt_router.c:4536-4600 port-ready handlers NULL (P6 territory) src/nxt_router.c:5043 idle-pool shrink NULL (NORMAL is right) src/nxt_router.c:5142 app-free cleanup NULL (out of P1 scope) Phases P5/P6 will revisit the router sites once the listener drain and reload endpoint exist. Tests ----- test/test_graceful_reload.py is new. Three functional tests plus one skipped placeholder: * test_sigquit_completes_inflight_request: SIGQUIT to main must take libunit's GRACEFUL branch. Asserts on the *absence* of "active request on ctx quit" at nxt_unit.c:5816 -- that marker fires only in the NORMAL branch's force-close loop, so its absence is positive evidence GRACEFUL was taken. Uses the ASGI delayed app so libunit's add_reader can dispatch the QUIT mid-request (a synchronous WSGI worker blocked in time.sleep would not pump libunit's message loop and the test would pass for the wrong reason). We do *not* assert on the response body because P1 plumbs GRACEFUL through libunit only; the router still tears down on QUIT (router-side drain is P5), so the client TCP connection RSTs the moment the router exits regardless of whether the worker drains gracefully. * test_sigterm_drops_inflight_request: asserts the *presence* of the same marker -- positive evidence the NORMAL fast-exit branch ran. Inverse of the SIGQUIT test. * test_sigint_takes_normal_path: regression guard against signal table edits that would re-route SIGINT through the SIGQUIT handler. Same shape as the SIGTERM test. * test_quit_message_carries_quit_param is a skipped placeholder documenting the wire-format intent; verifying the byte directly would require C-level instrumentation. A module-scoped autouse fixture skips the file with an actionable message when --restart is missing -- the autouse run fixture in conftest.py crashes teardown otherwise (PUT /config to a dead daemon). Verified -------- ./configure --tests --modules=python && ./configure python \ --config=python3-config && make -j$(nproc) # clean python3 -m pytest test/test_graceful_reload.py --restart # 3 pass, 1 skip python3 -m pytest test/test_idle_close_wait.py --restart # 2 pass python3 -m pytest test/test_procman.py --restart # 11 pass, 2 skip, # 1 pre-existing # flake unrelated # to P1 ASAN build (-fsanitize=address): 0 leaks attributable to nxt_runtime_quit_buf or any P1 code path. Sole LeakSanitizer report is a 136-byte pre-existing leak in nxt_var_index_init (src/nxt_var.c:296), reproducible on plain origin/master with no P1 changes -- separate issue.

gemini-code-assist

Code Review

This pull request fixes memory pool leaks and file descriptor/buffer leaks in the certificate, script, and access-log IPC paths. The changes ensure that memory pools are only retained after successful handoff to the port machinery and that file descriptors and buffers are explicitly closed or reclaimed if nxt_port_socket_write fails. I have no feedback to provide as the review comments were purely explanatory and did not identify any additional issues.

andypost · 2026-05-07T23:56:59Z

promoted upstream freeunitorg#56

Static review of master @ 7b12696 across 14 vectors (HTTP parsing, routing, TLS, control API, port IPC, isolation, language SAPIs, libunit ABI, static files, WebSocket, memory pool, FD lifetime). Findings: 1 Critical, 11 High, 24 Medium, 7 Low, 2 Informational. Each finding includes file:line reference, class, trigger, impact, suggested fix, and a `PR:` slot tying it to one of nine planned remediation PRs (PR-A through PR-I). Two findings excluded by maintainer DoS policy are flagged as such. Top 5 picks (auditor): V6 cgroup TOCTOU (Critical), V3 missing SSL_CTX_check_private_key, V5 untrusted shmem chunk_id, V9 Java InputStream.readLine bounds, V12 WebSocket frame_size loop bug. The Remediation tracker section gives an at-a-glance view; per-finding `PR:` bullets flip to merged-PR references as fixes land, so the file doubles as a remediation log. PR #8 (port-IPC retain/fd leaks) is acknowledged in the Appendix as the precedent. No source changes; documentation only.

andypost · 2026-05-08T01:23:56Z

Caught during the pedantic audit of PR #12 (P3 write-path Pattern D′ — sibling fix family to PR #8). Same subsystem (src/nxt_port_socket.c), same "port IPC accounting" theme, so flagging here rather than opening a separate issue.

Finding: `nxt_port_queue_read_handler` leaks `queue->nitems` on two suspend-message error paths

nxt_port_queue_read_handler (src/nxt_port_socket.c:812) maintains the queue->nitems counter as a reader-semaphore: it does nxt_atomic_fetch_add(&queue->nitems, 1) at function entry (:830), and the function relies on exactly one matching -1 on every return path.

The matched paths are:

:885 — if (n < 0 && !port->socket.read_ready) early return ✅
(P3 PR fix(io): generalize PR #54 write-path contract to non-TLS sites (P3) #12) :922 — b == NULL OOM teardown ✅

The two unbalanced exits are inside the if (n > 0) suspend-message block:

// :962 — suspend-message smsg alloc failure
smsg = nxt_mp_alloc(port->mem_pool, sizeof(nxt_port_recv_msg_t));
if (nxt_slow_path(smsg == NULL)) {
    nxt_alert(task, "port{%d,%d} %d: suspend message failed", ...);
    return;          // <-- nitems leak: never decremented
}

// :974 — "too many suspend messages"
} else {
    if (nxt_slow_path(smsg->size != 0)) {
        nxt_alert(task, "port{%d,%d} %d: too many suspend messages", ...);
        return;      // <-- nitems leak: never decremented
    }
}

Impact

queue->nitems is consulted by nxt_port_queue_send to decide whether the receiver is awake (it backs the wake-up notify; see the notify parameter at :194 and the port->queue != NULL && type != _NXT_PORT_MSG_READ_QUEUE branch in nxt_port_socket_write2). A leaked +1 means future senders see the receiver as "always still busy" and skip the wake-up notify for as long as the leaked count persists.

Triggers are rare in practice — nxt_mp_alloc failure on the port mem_pool, or a port->socket_msg already populated with a nonzero size when we try to suspend a new one. But when triggered, a single occurrence permanently degrades the wake-up signaling for that port, manifesting as occasional dropped/delayed messages that look like upstream timing flakes.

Suggested fix shape

Same shape as :885 and :922:

if (nxt_slow_path(smsg == NULL)) {
    nxt_alert(...);
    nxt_atomic_fetch_add(&queue->nitems, -1);
    return;
}
...
if (nxt_slow_path(smsg->size != 0)) {
    nxt_alert(...);
    nxt_atomic_fetch_add(&queue->nitems, -1);
    return;
}

Scope question

Pre-existing in upstream — predates PR #8 and PR #12 by years. Two options:

Fold into PR fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths #8 — same subsystem, same accounting class, same fix shape. Adds ~4 lines.
Separate follow-up PR — keeps PR fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths #8 narrowly focused on the cert/script/conf-store IPC, lets this land independently with its own deterministic regression discussion.

I lean (1) since it's literally the same accounting pattern PR #8 establishes and your branch already owns this file. Happy to push the diff if you prefer.

Generated by Claude Code

Some upstreams (Gitea, strict HTTP/1.1 backends) reject Transfer-Encoding: chunked and require Content-Length on forwarded requests. Buffer the chunked body, compute its length, and emit Content-Length while skipping the original TE field via field->skip. Also fix a buffer-stall in nxt_h1p_conn_request_body_read: nxt_http_chunk_parse leaves b->mem.pos unchanged on the CHUNK_MIDDLE path, so the next nxt_conn_read had zero space and the connection hung on bodies larger than body_buffer_size. Reset/compact the buffer between continuations. Test infrastructure: - fake_upstream: Rust HTTP mock with requires-cl, no-te, strict, echo modes for deterministic backend-behavior tests - run-local-temp.sh: fast dev runner (direct mount, no rsync) - run-local.sh: auto-enable clang-ast on C changes, build fake_upstream into image Closes nginx#58 Refs nginx#445, nginx#1088, nginx#1278

Idiomatic upper bound from header buffer end instead of pre-padded p + NXT_OFF_T_LEN. Same safety, less coupling. Refs nginx#58 test(fake_upstream): handle RFC 9112 chunk extensions Strip chunk-ext (5;ext=val) before hex parse so a future client emitting extensions does not hit unwrap_or(0). build(test): split clang-ast into run-local-full.sh run-local.sh --clang-ast was broken: Xclang plugin loaded during ./configure feature-detection trips "no atomic operations found" before make even starts. Move clang-ast to a dedicated runner with its own image (freeunit-test-full:local, debian:testing for clang 21). run-local.sh now focuses on the test fast-path only and auto-prefixes bare pytest node-ids with test/. run-local-temp.sh dropped — same fast-path is in run-local.sh. TODO.md split clang-ast status: OpenSSL 1.1 PASSED, 3.6 TBD.

The 4 fake_upstream-dependent tests in test_proxy_chunked.py failed in CI with FileNotFoundError: '/usr/local/bin/fake_upstream' — the Rust mock binary was never built. Install rust + cargo build for the unit and python-3.x matrix jobs that actually run these tests. Also add a module-level skipif marker so local runs without a Rust toolchain skip the 4 affected tests instead of failing hard. Idiomatic upper bound from header buffer end instead of pre-padded p + NXT_OFF_T_LEN. Same safety, less coupling. Refs nginx#58 test(fake_upstream): handle RFC 9112 chunk extensions Strip chunk-ext (5;ext=val) before hex parse so a future client emitting extensions does not hit unwrap_or(0). build(test): split clang-ast into run-local-full.sh run-local.sh --clang-ast was broken: Xclang plugin loaded during ./configure feature-detection trips "no atomic operations found" before make even starts. Move clang-ast to a dedicated runner with its own image (freeunit-test-full:local, debian:testing for clang 21). run-local.sh now focuses on the test fast-path only and auto-prefixes bare pytest node-ids with test/. run-local-temp.sh dropped — same fast-path is in run-local.sh. TODO.md split clang-ast status: OpenSSL 1.1 PASSED, 3.6 TBD.

Large chunked requests (>16KB) can span multiple buffers. Previous code only counted first buffer in r->body chain → upstream received truncated body with wrong Content-Length. Iterate over full chain matching pattern in nxt_router.c:5709. Refs nginx#58

The cert/script-store IPC pattern and several main-process reply paths have latent leaks reachable when nxt_port_msg_alloc() (or the RPC stream-id pool) hits malloc failure inside the port machinery. Sender side (nxt_cert_store_get, nxt_script_store_get): nxt_mp_retain(mp) was issued before nxt_port_socket_write(), so any failure path between the retain and a successful send left the pool with a refcount that the buffer's completion handler (which is what invokes nxt_mp_release) could never run. Moved the retain to after the buffer is handed off to the port machinery so failure paths above no longer pin the pool. Reply / main-process side: cert_store_get_handler, script_store_get_ handler, main_port_socket_handler, main_port_access_log_handler, and nxt_controller_conf_store all called nxt_port_socket_write() with an owned fd (or a buffer in the engine memory pool) and either ignored the return value with a (void) cast or skipped the failure branch entirely. On non-OK return the port layer never takes ownership, so the fd and the buffer's completion handler were both leaked in the privileged main process. Each site now closes the fd explicitly and (where applicable) queues the buffer completion onto the engine fast work queue so the engine pool reclaims memory. Use nxt_fd_close() rather than nxt_file_close() on the cert/script error paths: file.name has already been freed and the latter would dereference it through "%FN" on a close-failure log path. Also documents the ownership contract over nxt_port_socket_write2() in src/nxt_port.h: on NXT_OK, ownership of fd, fd2, and b transfers to the port layer; on any other return, the caller retains ownership and is responsible for closing fd/fd2 and dispatching b's completion handler. Addresses phpclub's review ask on freeunitorg#56. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Alexandr Smirnov and others added 4 commits May 7, 2026 22:00

Merge pull request nginx#54 from freeunitorg/issues/28

dc37532

fix(tls): stop SSL_write busy-loop on peer-initiated close

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/nxt_cert.c

Comment thread src/nxt_script.c

Comment thread src/nxt_cert.c

Comment thread src/nxt_main_process.c

Comment thread src/nxt_script.c

andypost force-pushed the claude/port-ipc-completion-leaks branch from ce07e90 to 2372fdb Compare May 7, 2026 23:09

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

Comment thread src/nxt_main_process.c Outdated

andypost force-pushed the claude/port-ipc-completion-leaks branch 2 times, most recently from 38ba4df to 5a9f37d Compare May 7, 2026 23:23

andypost changed the title ~~fix(port): plug mp-pool retain and fd leaks in cert/script/conf-store IPC~~ fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths May 7, 2026

andypost mentioned this pull request May 7, 2026

feat(lifecycle): plumb NXT_QUIT_GRACEFUL through SIGQUIT (P1) #7

Draft

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

andypost force-pushed the claude/port-ipc-completion-leaks branch from 5a9f37d to c88ede1 Compare May 7, 2026 23:51

andypost added bug Something isn't working labels May 8, 2026

andypost marked this pull request as draft May 8, 2026 00:26

This was referenced May 8, 2026

docs(roadmap): plans for IPC hardening pass + malloc-injection harness #9

Open

docs(security): add 14-vector audit + remediation tracker #10

Open

andypost mentioned this pull request May 8, 2026

fix(io): generalize PR #54 write-path contract to non-TLS sites (P3) #12

Open

andypost mentioned this pull request May 8, 2026

fix(security): tighten privilege boundaries on cgroup/mount/IPC/control-socket #14

Open

5 tasks

Update version

af6ee08

a.smirnov and others added 6 commits May 18, 2026 00:39

fix commit

57f87ed

andypost force-pushed the claude/port-ipc-completion-leaks branch from df6b00e to 71fcd89 Compare May 18, 2026 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths#8

fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths#8
andypost wants to merge 11 commits into
masterfrom
claude/port-ipc-completion-leaks

andypost commented May 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andypost commented May 7, 2026

Uh oh!

andypost commented May 7, 2026

Uh oh!

andypost commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

andypost commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

andypost commented May 7, 2026

Uh oh!

andypost commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andypost commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Leak A — sender side: unbalanced nxt_mp_retain

Leak B — receiver side: fd not closed when send fails

Files changed

Test plan

Upstream

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

andypost commented May 7, 2026

1. Buffer-leak pattern extends to three additional sites (same shape as gemini's review comment)

2. Pre-existing 136-byte LeakSanitizer finding in nxt_var_index_init

Uh oh!

andypost commented May 7, 2026

Uh oh!

andypost commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

andypost commented May 7, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

andypost commented May 7, 2026

Uh oh!

andypost commented May 8, 2026

Finding: nxt_port_queue_read_handler leaks queue->nitems on two suspend-message error paths

Impact

Suggested fix shape

Scope question

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andypost commented May 7, 2026 •

edited

Loading

Leak A — sender side: unbalanced `nxt_mp_retain`

2. Pre-existing 136-byte LeakSanitizer finding in `nxt_var_index_init`

Finding: `nxt_port_queue_read_handler` leaks `queue->nitems` on two suspend-message error paths