docs(roadmap): plans for IPC hardening pass + malloc-injection harness#9
docs(roadmap): plans for IPC hardening pass + malloc-injection harness#9andypost wants to merge 2 commits into
Conversation
…rness Two self-contained implementation plans for the natural follow-ups to PRs #6 (TLS OCSP stapling) and #8 (cert/script IPC retain/fd leak fix): * roadmap/plan-ipc-hardening.md — consolidate the path-join, send-failure, and mp-pool retain audits that were declined as out-of-scope on PR #6 and PR #8. Concrete site list, command bag, ~2 days estimate. * roadmap/plan-malloc-injection.md — LD_PRELOAD fault-injection harness so the leaks fixed by PR #6 / PR #8 stop being review-only verified. Three first-consumer tests cover both PRs' leak paths. Design notes on shim layout, pytest fixture, CI workflow, ~5–6 day estimate. * roadmap/README.md — index so the folder is discoverable in isolation. These docs only; no source changes. Each plan is self-contained so a fresh session can pick one up and execute without prior context.
There was a problem hiding this comment.
Code Review
This pull request introduces a roadmap directory with detailed implementation plans for IPC layer hardening and a malloc failure injection test harness. The IPC plan aims to prevent resource leaks during communication failures and improve path resolution safety. The test harness plan outlines an LD_PRELOAD-based shim for simulating allocation failures. Review feedback identifies a need to align the IPC error handling order with existing logic, resolve contradictions in the injection shim's symbol coverage, and address technical concerns regarding stack-walking fragility and socket path consistency in the test fixture.
| 4. Both → do both, completion first (matches `nxt_port_error_handler` | ||
| ordering; see PR #8 commit `5a9f37d`). |
There was a problem hiding this comment.
The plan suggests running the completion handler before closing the file descriptor to match nxt_port_error_handler. However, in the current implementation of nxt_port_error_handler in src/nxt_port_socket.c (lines 1361-1371), nxt_port_msg_close_fd(msg) is called before the completion handlers are added to the work queue. The plan should be updated to maintain consistency with the existing error handling logic.
| 4. Both → do both, completion first (matches `nxt_port_error_handler` | |
| ordering; see PR #8 commit `5a9f37d`). | |
| 4. Both → do both, FD first (matches `nxt_port_error_handler` | |
| ordering; see `src/nxt_port_socket.c:1361`). |
| For PR scope: ship only `malloc` + `posix_memalign`. The rest can come as | ||
| needed. |
There was a problem hiding this comment.
There is a contradiction between the "Symbol coverage" section and the examples provided earlier in the document. The coverage section states that only malloc and posix_memalign will be shipped, but the examples in the "Goal" and "Counter semantics" sections use nxt_port_msg_alloc as a target symbol. To support direct targeting of nxt_port_msg_alloc without complex stack walking on every malloc call, it should be included in the wrapped symbols.
| For PR scope: ship only `malloc` + `posix_memalign`. The rest can come as | |
| needed. | |
| For PR scope: ship `malloc`, `posix_memalign`, and `nxt_port_msg_alloc` wrappers. | |
| The rest can come as needed. |
| stack with `__builtin_return_address(1)` and resolve via `dladdr()`; if | ||
| the symbol matches, count and possibly fail. This costs ~1 µs per call |
There was a problem hiding this comment.
Using a fixed frame offset of 1 with __builtin_return_address is likely too fragile for Unit's architecture. Most allocations in the codebase are wrapped through multiple layers (e.g., nxt_mp_alloc -> nxt_malloc -> malloc). A frame index of 1 will often point to an internal Unit wrapper rather than the logical call site intended for targeting. The shim should ideally walk the stack to find the target symbol or support a configurable search depth.
| been started under LD_PRELOAD=build/malloc_inject.so (handled by the | ||
| `unit` fixture when the test is decorated with @pytest.mark.malloc_inject). | ||
| """ | ||
| sock = tmp_path / 'malloc-inject.sock' |
There was a problem hiding this comment.
The sock variable is defined in the fixture snippet but is not used in the subsequent _activate function or _send_to_shim call. Additionally, if the shim expects a control socket at /tmp/malloc-inject-<pid>.sock as stated in the design, using tmp_path (which is a per-test directory) might cause a mismatch unless the MALLOC_INJECT_CONTROL environment variable is also dynamically updated in the unit fixture.
Gemini code-assist (PR #9) flagged four issues; all fixed: - plan-ipc-hardening.md classify-step #4 had completion-first ordering; nxt_port_error_handler at src/nxt_port_socket.c:1361 actually closes the fd before queueing completion handlers. Reordered to FD-first with the canonical line citation. - plan-malloc-injection.md symbol-coverage list contradicted the design intent. Clarified that nxt_port_msg_alloc and other Unit-level helpers are reached via the stack-walk filter, not via direct wrappers; and bumped mmap from "future" to v1 ship so audit V11 can use it. - plan-malloc-injection.md stack-walk used a fixed __builtin_return_address(1). Replaced with a configurable-depth walk (default 8, per-target /N override), with rationale that nxt_port_msg_alloc -> nxt_malloc -> malloc puts the logical caller two frames up, not one. Added pre-resolution of target symbols at shim init to dodge dladdr() locks. - plan-malloc-injection.md fixture snippet had an unused tmp_path/sock variable and a control-socket-path mismatch. Dropped the unused arg and documented the /tmp/malloc-inject-<pid>.sock convention plus the unit-fixture's responsibility to export MALLOC_INJECT_CONTROL. Cross-link to the wider security audit (gist andypost/e04a4a642e168de2b8435a593f03b84b): - README.md gets a "See also" pointing at the audit and explaining these plans sit outside the audit's PR-A..PR-I tracker (they're follow-ups to the audit's "Known/Already-Fixed" precedent, PR nginx#56). - plan-ipc-hardening.md "Out of scope" now calls out audit slot PR-E (general FD-lifetime hygiene) so reviewers don't ask why it isn't rolled in. - plan-malloc-injection.md "Suggested follow-on uses" lists audit V11 (compression mmap FD leak) as the natural second consumer once the mmap wrapper ships. No source code touched.
Summary
Two self-contained implementation plans for the natural Tier 2 follow-ups to PRs #6 (TLS OCSP stapling) and #8 (cert/script IPC leak fix). Docs only — no source changes. Designed so a fresh session can pick either one up and execute without prior context from the PR #6 / #8 review threads.
Files
roadmap/plan-ipc-hardening.md— IPC layer cleanup passConsolidates the items declined as out-of-scope on PR #6 and PR #8:
nxt_mp_retainaudit — verify no remaining sites repeat the pre-fix shape.(void) nxt_port_socket_write(..., file.fd, ...)site that shipsNXT_PORT_MSG_CLOSE_FD.(void) nxt_port_socket_write(..., b)site whoseb->completion_handlerreleases a refcount-bearing resource./" assumption onrt->{certs,scripts}.start.Includes the exact site lists from the PR #6 / PR #8 audits, scope-out callouts, test plan, and a quick-reference command bag for the next session. ~2 days estimated.
roadmap/plan-malloc-injection.md— fault-injection harnessLD_PRELOADshim + pytest fixture so the leaks fixed by PR #6 and PR #8 are regression-fenced (currently they're review-verified only because the trigger requiresmalloc()to fail, which CI can't drive).Design notes cover:
LD_PRELOADand not__malloc_hook(deprecated in glibc 2.34+) or build-time wrappers.MALLOC_INJECT_TARGETS=malloc@nxt_port_msg_alloc:1).tools/malloc_inject/), pytest fixture wiring, and a separate CI workflow.~5–6 days estimated. Designed to be incremental — once shipped, future PRs can add new fault-injection consumers without re-litigating the harness design.
roadmap/README.mdIndex so the folder is discoverable in isolation.
Suggested execution order
freeunitorg/freeunit.plan-ipc-hardeningPR opens against post-merge master (so the audit operates on canonical line numbers).plan-malloc-injectionPR opens — backfills regression coverage for the leak fixes plus the new IPC-hardening fixes.Each plan PR can be picked up by a fresh Claude session via
/loopor invoked directly; both prompts are self-contained.Test plan
git diff origin/master --statshows onlyroadmap/additions).Out of scope
Actually implementing either plan. That's the next session's job.
Generated by Claude Code
Generated by Claude Code