Skip to content

docs(roadmap): plans for IPC hardening pass + malloc-injection harness#9

Open
andypost wants to merge 2 commits into
masterfrom
claude/plan-ipc-hardening-tier2
Open

docs(roadmap): plans for IPC hardening pass + malloc-injection harness#9
andypost wants to merge 2 commits into
masterfrom
claude/plan-ipc-hardening-tier2

Conversation

@andypost
Copy link
Copy Markdown
Owner

@andypost andypost commented May 8, 2026

Summary

Two self-contained implementation plans for the natural Tier 2 follow-ups to PRs #6 (TLS OCSP stapling) and #8 (cert/script IPC leak fix). Docs only — no source changes. Designed so a fresh session can pick either one up and execute without prior context from the PR #6 / #8 review threads.

Files

roadmap/plan-ipc-hardening.md — IPC layer cleanup pass

Consolidates the items declined as out-of-scope on PR #6 and PR #8:

  • Sender-side nxt_mp_retain audit — verify no remaining sites repeat the pre-fix shape.
  • Receiver-side fd-close-on-send-failure audit — every (void) nxt_port_socket_write(..., file.fd, ...) site that ships NXT_PORT_MSG_CLOSE_FD.
  • Buffer-completion-on-send-failure audit — every (void) nxt_port_socket_write(..., b) site whose b->completion_handler releases a refcount-bearing resource.
  • Path-join helper for cert/script/OCSP store handlers (Gemini PR Add TLS OCSP stapling support with certificate store integration #6 finding 3, declined). Defends against the implicit "ends with /" assumption on rt->{certs,scripts}.start.
  • Leaf-name validation at the controller→main-process trust boundary — defense in depth.

Includes the exact site lists from the PR #6 / PR #8 audits, scope-out callouts, test plan, and a quick-reference command bag for the next session. ~2 days estimated.

roadmap/plan-malloc-injection.md — fault-injection harness

LD_PRELOAD shim + pytest fixture so the leaks fixed by PR #6 and PR #8 are regression-fenced (currently they're review-verified only because the trigger requires malloc() to fail, which CI can't drive).

Design notes cover:

  • Why LD_PRELOAD and not __malloc_hook (deprecated in glibc 2.34+) or build-time wrappers.
  • Per-symbol + per-call-site targeting via stack-walk filter (MALLOC_INJECT_TARGETS=malloc@nxt_port_msg_alloc:1).
  • File layout (tools/malloc_inject/), pytest fixture wiring, and a separate CI workflow.
  • Three first-consumer tests covering both PRs' leak paths exactly.
  • Risk callouts (setuid, thread safety, dlsym bootstrap, glibc-vs-jemalloc).

~5–6 days estimated. Designed to be incremental — once shipped, future PRs can add new fault-injection consumers without re-litigating the harness design.

roadmap/README.md

Index so the folder is discoverable in isolation.

Suggested execution order

  1. PR fix(port): plug mp-pool retain and fd/buffer leaks in IPC reply paths #8 merges → forwards upstream to freeunitorg/freeunit.
  2. PR Add TLS OCSP stapling support with certificate store integration #6 rebases → merges → forwards upstream.
  3. plan-ipc-hardening PR opens against post-merge master (so the audit operates on canonical line numbers).
  4. plan-malloc-injection PR opens — backfills regression coverage for the leak fixes plus the new IPC-hardening fixes.

Each plan PR can be picked up by a fresh Claude session via /loop or invoked directly; both prompts are self-contained.

Test plan

  • Both files render correctly as markdown.
  • No source code touched (git diff origin/master --stat shows only roadmap/ additions).

Out of scope

Actually implementing either plan. That's the next session's job.


Generated by Claude Code


Generated by Claude Code

…rness

Two self-contained implementation plans for the natural follow-ups to
PRs #6 (TLS OCSP stapling) and #8 (cert/script IPC retain/fd leak fix):

* roadmap/plan-ipc-hardening.md — consolidate the path-join, send-failure,
  and mp-pool retain audits that were declined as out-of-scope on PR #6
  and PR #8.  Concrete site list, command bag, ~2 days estimate.

* roadmap/plan-malloc-injection.md — LD_PRELOAD fault-injection harness
  so the leaks fixed by PR #6 / PR #8 stop being review-only verified.
  Three first-consumer tests cover both PRs' leak paths.  Design notes
  on shim layout, pytest fixture, CI workflow, ~5–6 day estimate.

* roadmap/README.md — index so the folder is discoverable in isolation.

These docs only; no source changes.  Each plan is self-contained so a
fresh session can pick one up and execute without prior context.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a roadmap directory with detailed implementation plans for IPC layer hardening and a malloc failure injection test harness. The IPC plan aims to prevent resource leaks during communication failures and improve path resolution safety. The test harness plan outlines an LD_PRELOAD-based shim for simulating allocation failures. Review feedback identifies a need to align the IPC error handling order with existing logic, resolve contradictions in the injection shim's symbol coverage, and address technical concerns regarding stack-walking fragility and socket path consistency in the test fixture.

Comment thread roadmap/plan-ipc-hardening.md Outdated
Comment on lines +101 to +102
4. Both → do both, completion first (matches `nxt_port_error_handler`
ordering; see PR #8 commit `5a9f37d`).
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The plan suggests running the completion handler before closing the file descriptor to match nxt_port_error_handler. However, in the current implementation of nxt_port_error_handler in src/nxt_port_socket.c (lines 1361-1371), nxt_port_msg_close_fd(msg) is called before the completion handlers are added to the work queue. The plan should be updated to maintain consistency with the existing error handling logic.

Suggested change
4. Both → do both, completion first (matches `nxt_port_error_handler`
ordering; see PR #8 commit `5a9f37d`).
4. Both → do both, FD first (matches `nxt_port_error_handler`
ordering; see `src/nxt_port_socket.c:1361`).

Comment thread roadmap/plan-malloc-injection.md Outdated
Comment on lines +88 to +89
For PR scope: ship only `malloc` + `posix_memalign`. The rest can come as
needed.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a contradiction between the "Symbol coverage" section and the examples provided earlier in the document. The coverage section states that only malloc and posix_memalign will be shipped, but the examples in the "Goal" and "Counter semantics" sections use nxt_port_msg_alloc as a target symbol. To support direct targeting of nxt_port_msg_alloc without complex stack walking on every malloc call, it should be included in the wrapped symbols.

Suggested change
For PR scope: ship only `malloc` + `posix_memalign`. The rest can come as
needed.
For PR scope: ship `malloc`, `posix_memalign`, and `nxt_port_msg_alloc` wrappers.
The rest can come as needed.

Comment thread roadmap/plan-malloc-injection.md Outdated
Comment on lines +102 to +103
stack with `__builtin_return_address(1)` and resolve via `dladdr()`; if
the symbol matches, count and possibly fail. This costs ~1 µs per call
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a fixed frame offset of 1 with __builtin_return_address is likely too fragile for Unit's architecture. Most allocations in the codebase are wrapped through multiple layers (e.g., nxt_mp_alloc -> nxt_malloc -> malloc). A frame index of 1 will often point to an internal Unit wrapper rather than the logical call site intended for targeting. The shim should ideally walk the stack to find the target symbol or support a configurable search depth.

Comment thread roadmap/plan-malloc-injection.md Outdated
been started under LD_PRELOAD=build/malloc_inject.so (handled by the
`unit` fixture when the test is decorated with @pytest.mark.malloc_inject).
"""
sock = tmp_path / 'malloc-inject.sock'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The sock variable is defined in the fixture snippet but is not used in the subsequent _activate function or _send_to_shim call. Additionally, if the shim expects a control socket at /tmp/malloc-inject-<pid>.sock as stated in the design, using tmp_path (which is a per-test directory) might cause a mismatch unless the MALLOC_INJECT_CONTROL environment variable is also dynamically updated in the unit fixture.

@andypost
Copy link
Copy Markdown
Owner Author

andypost commented May 8, 2026

Gemini code-assist (PR #9) flagged four issues; all fixed:

- plan-ipc-hardening.md classify-step #4 had completion-first ordering;
  nxt_port_error_handler at src/nxt_port_socket.c:1361 actually closes
  the fd before queueing completion handlers. Reordered to FD-first
  with the canonical line citation.
- plan-malloc-injection.md symbol-coverage list contradicted the design
  intent. Clarified that nxt_port_msg_alloc and other Unit-level helpers
  are reached via the stack-walk filter, not via direct wrappers; and
  bumped mmap from "future" to v1 ship so audit V11 can use it.
- plan-malloc-injection.md stack-walk used a fixed __builtin_return_address(1).
  Replaced with a configurable-depth walk (default 8, per-target /N
  override), with rationale that nxt_port_msg_alloc -> nxt_malloc ->
  malloc puts the logical caller two frames up, not one. Added
  pre-resolution of target symbols at shim init to dodge dladdr() locks.
- plan-malloc-injection.md fixture snippet had an unused tmp_path/sock
  variable and a control-socket-path mismatch. Dropped the unused arg
  and documented the /tmp/malloc-inject-<pid>.sock convention plus the
  unit-fixture's responsibility to export MALLOC_INJECT_CONTROL.

Cross-link to the wider security audit
(gist andypost/e04a4a642e168de2b8435a593f03b84b):

- README.md gets a "See also" pointing at the audit and explaining
  these plans sit outside the audit's PR-A..PR-I tracker (they're
  follow-ups to the audit's "Known/Already-Fixed" precedent, PR nginx#56).
- plan-ipc-hardening.md "Out of scope" now calls out audit slot PR-E
  (general FD-lifetime hygiene) so reviewers don't ask why it isn't
  rolled in.
- plan-malloc-injection.md "Suggested follow-on uses" lists audit V11
  (compression mmap FD leak) as the natural second consumer once the
  mmap wrapper ships.

No source code touched.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants