Skip to content

feat(lifecycle): plumb NXT_QUIT_GRACEFUL through SIGQUIT (P1)#7

Draft
andypost wants to merge 1 commit into
masterfrom
p1-graceful-signal-split
Draft

feat(lifecycle): plumb NXT_QUIT_GRACEFUL through SIGQUIT (P1)#7
andypost wants to merge 1 commit into
masterfrom
p1-graceful-signal-split

Conversation

@andypost
Copy link
Copy Markdown
Owner

@andypost andypost commented May 7, 2026

Summary

Phase 1 of the graceful-shutdown / graceful-reload plan documented in roadmap/plan-graceful-shutdown.md on the roadmap branch. Splits the previously-identical SIGTERM and SIGQUIT handlers in src/nxt_main_process.c and plumbs the choice as a quit_param byte through NXT_PORT_MSG_QUIT so libunit's already-implemented nxt_unit_quit(NXT_QUIT_GRACEFUL) actually gets invoked.

What changes

  • src/nxt_main_process.c — SIGTERM and SIGQUIT handlers now diverge: SIGTERM stores rt->quit_mode = NXT_PORT_QUIT_NORMAL; SIGQUIT stores NXT_PORT_QUIT_GRACEFUL. The /* TODO: fast exit */ and /* TODO: graceful exit */ comments are gone.
  • src/nxt_runtime.h — adds uint8_t quit_mode next to other small flags (no struct bloat).
  • src/nxt_runtime.c — new nxt_runtime_quit_buf() helper allocates a one-byte buffer carrying rt->quit_mode. Both nxt_runtime_stop_app_processes() and nxt_runtime_stop_all_processes() attach it to the QUIT message.
  • src/nxt_port.h — promotes the QUIT mode selector to a public enum nxt_port_quit_mode_t next to NXT_PORT_MSG_QUIT itself.
  • src/nxt_unit.c — local NXT_QUIT_NORMAL/NXT_QUIT_GRACEFUL are now #define aliases of the public enum so the 10+ in-tree call sites don't churn and a compile-time mismatch is structurally impossible.
  • test/test_graceful_reload.py — new test file with three tests (one is a skipped placeholder for C-level instrumentation).

Wire-format compatibility

libunit at nxt_unit.c:1056-1070 already parses this exact wire format and falls back to NXT_PORT_QUIT_NORMAL when the message arrives without a payload. Pre-P1 senders (and OOM-degraded sends where the buf alloc fails) therefore produce the safe fast-exit behaviour. Audit of all 10 in-tree NXT_PORT_MSG_QUIT call sites: 2 plumbed (the main signal cascade), 7 left NULL with rationale (idle-shrink, app-free, restart paths that belong to later phases — see commit message of 064fa68).

Tests

./configure --tests --modules=python && ./configure python --config=python3-config && make -j$(nproc)
python3 -m pytest test/test_graceful_reload.py --restart -v
  => 2 passed, 1 skipped (C-instrumentation placeholder)
python3 -m pytest test/test_idle_close_wait.py --restart -v
  => 2 passed
python3 -m pytest test/test_procman.py --restart
  => 11 passed, 2 skipped, 1 pre-existing flake (test_python_restart_longstart, unrelated to P1)

The test_sigterm_drops_inflight_request test asserts on positive log evidence — the "active request on ctx quit" warning at nxt_unit.c:5816 fires only on the NORMAL fast-exit branch when active_req is non-empty. A regression to GRACEFUL on SIGTERM would never emit that marker and the assertion would fail (instead of silently passing).

--restart is required because the tests signal the unitd master directly; an autouse fixture in the test file skips with an actionable message when the flag is missing.

Commits

  1. feat(lifecycle): plumb NXT_QUIT_GRACEFUL through SIGQUIT — initial implementation.
  2. fixup(lifecycle): tighten P1 review nits — promote magic 0/1 to a named enum in nxt_port.h; close a regression-detection hole in test 2 by replacing pytest.skip with positive log-evidence assertion; bump INFLIGHT_DELAY 3 → 5 s for slack on fast machines.

Out of scope

This PR is P1 only. P2 (listener drain), P3 (write-path D′), P4 (engine teardown), P5 (connection drain), P6 (X3 POST /reload endpoint), and P7 (per-language hooks) are queued behind it per the plan. Each will land as its own PR.

The 7 untouched NXT_PORT_MSG_QUIT call sites in nxt_router.c and nxt_application.c are deferred — most belong to P6 (graceful reload via the existing app_restart machinery) where they should propagate NXT_PORT_QUIT_GRACEFUL once the lifecycle skeleton supports it.


Generated by Claude Code

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the first phase of a graceful shutdown plan by differentiating between SIGTERM and SIGQUIT signals. SIGQUIT now triggers a graceful exit by setting the runtime's quit mode to NXT_PORT_QUIT_GRACEFUL, which is communicated to workers via a new payload in the QUIT message, while SIGTERM remains the fast-exit path. The changes include updates to signal handlers, runtime structures, and the wire protocol, supported by new integration tests. Feedback was provided to optimize the quit buffer allocation by skipping it when the mode is NXT_PORT_QUIT_NORMAL, as the receiver already defaults to that value.

Comment thread src/nxt_runtime.c
andypost pushed a commit that referenced this pull request May 7, 2026
Phase 1 of roadmap/plan-graceful-shutdown.md (on the `roadmap`
branch).  Splits the previously-identical SIGTERM and SIGQUIT
handlers in src/nxt_main_process.c and plumbs the choice as a
quit_param byte through NXT_PORT_MSG_QUIT so libunit's
already-implemented nxt_unit_quit(NXT_QUIT_GRACEFUL) actually
gets invoked.

Wire format
-----------
A new public enum nxt_port_quit_mode_t in nxt_port.h sits next to
NXT_PORT_MSG_QUIT itself:

    NXT_PORT_QUIT_NORMAL   = 0   /* fast exit, drop in-flight */
    NXT_PORT_QUIT_GRACEFUL = 1   /* drain in-flight before exit */

libunit at nxt_unit.c:1056-1070 already parses this byte; it falls
back to NXT_PORT_QUIT_NORMAL when the message arrives without a
payload.  We therefore send NO payload on the fast-exit path and
exactly one byte on the graceful path -- both directions of the
asymmetry are the safe ones:

  * Pre-P1 senders (and the seven NXT_PORT_MSG_QUIT call sites
    deliberately left NULL below) keep producing the safe NORMAL
    behaviour with no wire change.
  * Allocation failure under GRACEFUL silently degrades to NORMAL,
    which is what one wants under memory pressure anyway.
  * SIGTERM, the fast-exit path by definition, performs zero
    additional allocations on its way to nxt_runtime_quit().

Avoiding the allocation on the NORMAL path is gemini-code-assist's
review suggestion on PR #7.

src/ changes
------------
* src/nxt_main_process.c -- SIGTERM now sets
  rt->quit_mode = NXT_PORT_QUIT_NORMAL; SIGQUIT sets
  NXT_PORT_QUIT_GRACEFUL.  The /* TODO: fast exit */ and
  /* TODO: graceful exit */ comments are gone.

* src/nxt_runtime.h -- new uint8_t quit_mode next to other small
  flags (no struct bloat).  Documented as a nxt_port_quit_mode_t.

* src/nxt_runtime.c -- new nxt_runtime_quit_buf() helper returns
  NULL for NORMAL (no allocation) and a one-byte mem_pool buffer
  for GRACEFUL.  Both nxt_runtime_stop_app_processes() and
  nxt_runtime_stop_all_processes() attach it to the QUIT message.

* src/nxt_port.h -- promotes nxt_port_quit_mode_t to a public enum
  alongside NXT_PORT_MSG_QUIT.

* src/nxt_unit.c -- existing local NXT_QUIT_NORMAL / NXT_QUIT_GRACEFUL
  identifiers (used at 10+ libunit call sites) become #define
  aliases of the public names so a compile-time mismatch between
  the daemon-side and libunit-side values is structurally
  impossible -- the preprocessor substitutes the same enum value
  into every reference.  No churn at the call sites.

NXT_PORT_MSG_QUIT call-site audit
---------------------------------
  src/nxt_runtime.c:511      stop_app_processes      plumbed
  src/nxt_runtime.c:533      stop_all_processes      plumbed
  src/nxt_application.c:704  proto_quit_children     NULL (defaults NORMAL)
  src/nxt_main_process.c:1038 orphan reaping         NULL (defensive cleanup)
  src/nxt_router.c:932       prototype replaced      NULL (P6 territory)
  src/nxt_router.c:4536-4600 port-ready handlers     NULL (P6 territory)
  src/nxt_router.c:5043      idle-pool shrink        NULL (NORMAL is right)
  src/nxt_router.c:5142      app-free cleanup        NULL (out of P1 scope)

Phases P2/P5/P6 will revisit the router sites once the listener
drain and reload endpoint exist.

Tests
-----
test/test_graceful_reload.py is new and exercises the plumbing
behaviourally:

  * test_sigquit_completes_inflight_request: SIGQUIT must let an
    ASGI in-flight request complete with status 200.

  * test_sigterm_drops_inflight_request: asserts on positive log
    evidence -- "active request on ctx quit" at nxt_unit.c:5816
    fires only on the NORMAL fast-exit branch when active_req is
    non-empty.  Log.wait_for_record() polls up to 15 s; absence of
    the marker means SIGTERM accidentally took the GRACEFUL path
    (the regression we are guarding against) and the assertion
    fails rather than silently passing.

  * test_quit_message_carries_quit_param is a skipped placeholder
    documenting the wire-format intent; verifying the byte
    directly would require C-level instrumentation.

A module-scoped autouse fixture skips the file with an actionable
message when --restart is missing -- the autouse run fixture in
conftest.py crashes teardown otherwise (PUT /config to a dead
daemon).

Verified
--------
  ./configure --tests --modules=python && ./configure python \
    --config=python3-config && make -j$(nproc)              # clean
  python3 -m pytest test/test_graceful_reload.py --restart  # 2 pass, 1 skip
  python3 -m pytest test/test_idle_close_wait.py --restart  # 2 pass
  python3 -m pytest test/test_procman.py --restart          # 11 pass, 2 skip,
                                                            # 1 pre-existing
                                                            # flake unrelated
                                                            # to P1
@andypost andypost force-pushed the p1-graceful-signal-split branch from 7652aad to 6c547c5 Compare May 7, 2026 22:24
andypost pushed a commit that referenced this pull request May 7, 2026
Phase 1 of roadmap/plan-graceful-shutdown.md (on the `roadmap`
branch).  Splits the previously-identical SIGTERM and SIGQUIT
handlers in src/nxt_main_process.c and plumbs the choice as a
quit_param byte through NXT_PORT_MSG_QUIT so libunit's
already-implemented nxt_unit_quit(NXT_QUIT_GRACEFUL) actually
gets invoked.  The plumbing also propagates through the
prototype -> child cascade so SIGQUIT to the unitd master
delivers GRACEFUL semantics to every libunit context, not just
the children main contacts directly.

Wire format
-----------
A new public enum nxt_port_quit_mode_t in nxt_port.h sits next to
NXT_PORT_MSG_QUIT itself:

    NXT_PORT_QUIT_NORMAL   = 0   /* fast exit, drop in-flight */
    NXT_PORT_QUIT_GRACEFUL = 1   /* drain in-flight before exit */

libunit at nxt_unit.c:1056-1070 already parses this byte; it falls
back to NXT_PORT_QUIT_NORMAL when the message arrives without a
payload.  We therefore send NO payload on the fast-exit path and
exactly one byte on the graceful path -- both directions of the
asymmetry are the safe ones:

  * Pre-P1 senders (and the seven NXT_PORT_MSG_QUIT call sites
    deliberately left NULL in the audit below) keep producing the
    safe NORMAL behaviour with no wire change.
  * Allocation failure under GRACEFUL silently degrades to NORMAL,
    which is what one wants under memory pressure anyway.
  * SIGTERM, the fast-exit path by definition, performs zero
    additional allocations on its way to nxt_runtime_quit().

Avoiding the allocation on the NORMAL path is gemini-code-assist's
review suggestion on PR #7.

Prototype cascade
-----------------
nxt_runtime_stop_all_processes() (called from main on SIGQUIT)
walks rt->processes and sends NXT_PORT_MSG_QUIT to every port,
including each app worker AND the prototype.  The prototype then
runs nxt_proto_quit_handler() which previously cascaded a *second*
QUIT message to its children with a NULL payload -- libunit
defaulted that to NORMAL, creating a race: whichever message
reached a child first decided GRACEFUL vs NORMAL behaviour.  On a
busy or slow box the cascade could win and silently downgrade
SIGQUIT to a fast exit.

The prototype handler now reads the quit_param byte from main's
QUIT message and forwards it through nxt_proto_quit_children()
unchanged.  Both messages now agree, the race is benign, and
GRACEFUL reaches every child regardless of arrival order.

src/ changes
------------
* src/nxt_main_process.c -- SIGTERM now sets
  rt->quit_mode = NXT_PORT_QUIT_NORMAL; SIGQUIT sets
  NXT_PORT_QUIT_GRACEFUL.  The /* TODO: fast exit */ and
  /* TODO: graceful exit */ comments are gone.

* src/nxt_runtime.h -- new uint8_t quit_mode next to other small
  flags (no struct bloat).  Documented as a nxt_port_quit_mode_t.
  Exports nxt_runtime_quit_buf() so nxt_application.c can use the
  same allocator.

* src/nxt_runtime.c -- nxt_runtime_quit_buf(task, quit_param)
  returns NULL for NORMAL (no allocation) and a one-byte
  mem_pool buffer for GRACEFUL.  Both nxt_runtime_stop_app_processes()
  and nxt_runtime_stop_all_processes() call it with rt->quit_mode.

* src/nxt_application.c -- nxt_proto_quit_handler() reads the
  quit_param byte from msg and forwards it via the new
  nxt_proto_quit_children(task, quit_param) signature.  Direct
  signal handler nxt_proto_sigterm_handler() passes
  NXT_PORT_QUIT_NORMAL explicitly: signals to the prototype are
  not the user-initiated lifecycle path (that is main ->
  NXT_PORT_MSG_QUIT) and the historical fast-exit semantics are
  preserved.

* src/nxt_port.h -- promotes nxt_port_quit_mode_t to a public enum
  alongside NXT_PORT_MSG_QUIT.

* src/nxt_unit.c -- existing local NXT_QUIT_NORMAL / NXT_QUIT_GRACEFUL
  identifiers (used at 10+ libunit call sites) become #define
  aliases of the public names so a compile-time mismatch between
  the daemon-side and libunit-side values is structurally
  impossible -- the preprocessor substitutes the same enum value
  into every reference.  No churn at the call sites.

NXT_PORT_MSG_QUIT call-site audit
---------------------------------
  src/nxt_runtime.c:511      stop_app_processes      plumbed (rt->quit_mode)
  src/nxt_runtime.c:533      stop_all_processes      plumbed (rt->quit_mode)
  src/nxt_application.c:716  proto_quit_children     plumbed (cascaded byte)
  src/nxt_main_process.c:1038 orphan reaping         NULL (defensive cleanup)
  src/nxt_router.c:932       prototype replaced      NULL (P6 territory)
  src/nxt_router.c:4536-4600 port-ready handlers     NULL (P6 territory)
  src/nxt_router.c:5043      idle-pool shrink        NULL (NORMAL is right)
  src/nxt_router.c:5142      app-free cleanup        NULL (out of P1 scope)

Phases P5/P6 will revisit the router sites once the listener
drain and reload endpoint exist.

Tests
-----
test/test_graceful_reload.py is new and exercises the plumbing:

  * test_sigquit_completes_inflight_request: SIGQUIT to main must
    let an ASGI in-flight request complete with status 200.
    Implicitly exercises the prototype cascade path (every Unit
    deployment has prototype + child topology, even at the default
    processes:0 lazy-spawn).

  * test_sigterm_drops_inflight_request: asserts on positive log
    evidence -- "active request on ctx quit" at nxt_unit.c:5816
    fires only on the NORMAL fast-exit branch when active_req is
    non-empty.  Log.wait_for_record() polls the unit log up to ~15
    s; absence of the marker means SIGTERM accidentally took the
    GRACEFUL path (the regression we are guarding against) and the
    assertion fails rather than silently passing.

  * test_sigint_takes_normal_path: regression guard against signal
    table edits that re-route SIGINT through the SIGQUIT handler.
    Same positive-log-evidence shape as the SIGTERM test.

  * test_quit_message_carries_quit_param is a skipped placeholder
    documenting the wire-format intent; verifying the byte
    directly would require C-level instrumentation.

A module-scoped autouse fixture skips the file with an actionable
message when --restart is missing -- the autouse run fixture in
conftest.py crashes teardown otherwise (PUT /config to a dead
daemon).

Verified
--------
  ./configure --tests --modules=python && ./configure python \
    --config=python3-config && make -j$(nproc)              # clean
  python3 -m pytest test/test_graceful_reload.py --restart  # 3 pass, 1 skip
  python3 -m pytest test/test_idle_close_wait.py --restart  # 2 pass
  python3 -m pytest test/test_procman.py --restart          # 11 pass, 2 skip,
                                                            # 1 pre-existing
                                                            # flake unrelated
                                                            # to P1
  ASAN build (-fsanitize=address):
    0 leaks attributable to nxt_runtime_quit_buf or any P1 code
    path.  Sole LeakSanitizer report is a 136-byte pre-existing
    leak in nxt_var_index_init (src/nxt_var.c:296), reproducible
    on plain origin/master with no P1 changes -- separate issue.
@andypost andypost force-pushed the p1-graceful-signal-split branch from 6c547c5 to b26a5d5 Compare May 7, 2026 22:56
@andypost
Copy link
Copy Markdown
Owner Author

andypost commented May 7, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a graceful shutdown mechanism for the application by splitting the SIGTERM and SIGQUIT signal handlers. SIGQUIT now triggers a graceful drain of in-flight requests, while SIGTERM maintains the existing fast-exit behavior. The implementation introduces a wire-format payload for the QUIT message to communicate the exit mode. A potential memory leak was identified in the buffer allocation logic for the QUIT message, which should be addressed by using a task-safe allocation method to ensure proper memory reclamation.

Comment thread src/nxt_runtime.c
Phase 1 of roadmap/plan-graceful-shutdown.md (on the `roadmap`
branch).  Splits the previously-identical SIGTERM and SIGQUIT
handlers in src/nxt_main_process.c and plumbs the choice as a
quit_param byte through NXT_PORT_MSG_QUIT so libunit's
already-implemented nxt_unit_quit(NXT_PORT_QUIT_GRACEFUL) actually
gets invoked.  The plumbing also propagates through the
prototype -> child cascade so SIGQUIT to the unitd master
delivers GRACEFUL semantics to every libunit context, not just
the children main contacts directly.

Wire format
-----------
A new public enum nxt_port_quit_mode_t in nxt_port.h sits next to
NXT_PORT_MSG_QUIT itself:

    NXT_PORT_QUIT_NORMAL   = 0   /* fast exit, drop in-flight */
    NXT_PORT_QUIT_GRACEFUL = 1   /* drain in-flight before exit */

libunit at nxt_unit.c:1056-1070 already parses this byte; it falls
back to NXT_PORT_QUIT_NORMAL when the message arrives without a
payload.  We therefore send NO payload on the fast-exit path and
exactly one byte on the graceful path -- both directions of the
asymmetry are the safe ones:

  * Pre-P1 senders (and the seven NXT_PORT_MSG_QUIT call sites
    deliberately left NULL in the audit below) keep producing the
    safe NORMAL behaviour with no wire change.
  * Allocation failure under GRACEFUL silently degrades to NORMAL
    with a NXT_LOG_WARN entry so the operator sees that the
    cascade leg fell back to fast exit under memory pressure.
  * SIGTERM, the fast-exit path by definition, performs zero
    additional allocations on its way to nxt_runtime_quit().

Skipping the allocation on the NORMAL path (and thus on the seven
deliberately-NULL call sites) is gemini-code-assist's review
suggestion on the initial PR; this commit lands the squashed
result.

Prototype cascade
-----------------
nxt_runtime_stop_all_processes() (called from main on SIGQUIT)
walks rt->processes and sends NXT_PORT_MSG_QUIT to every port,
including each app worker AND the prototype.  The prototype then
runs nxt_proto_quit_handler() which previously cascaded a *second*
QUIT message to its children with a NULL payload -- libunit
defaulted that to NORMAL, creating a race: whichever message
reached a child first decided GRACEFUL vs NORMAL behaviour.  On a
busy or slow box the cascade could win and silently downgrade
SIGQUIT to a fast exit.

The prototype handler now reads the quit_param byte from main's
QUIT message and forwards it through nxt_proto_quit_children()
unchanged.  Both messages now agree, the race is benign, and
GRACEFUL reaches every child regardless of arrival order.
Unknown payload bytes (anything other than 0 or 1) are normalised
to NORMAL on read so a malformed sender cannot propagate a bogus
byte through the worker pool.

Buffer ownership
----------------
nxt_runtime_quit_buf() returns NULL for NORMAL (no allocation) or
a one-byte mem_pool buffer for GRACEFUL.  All three send sites
(nxt_runtime_stop_app_processes, nxt_runtime_stop_all_processes,
nxt_proto_quit_children) capture the nxt_port_socket_write return
value and explicitly run b->completion_handler when send fails
and b is non-NULL -- otherwise the GRACEFUL payload would leak
from the engine memory pool, the same shape PR #8 fixes for the
cert/script/conf-store IPC paths.

src/ changes
------------
* src/nxt_main_process.c -- SIGTERM now sets
  rt->quit_mode = NXT_PORT_QUIT_NORMAL; SIGQUIT sets
  NXT_PORT_QUIT_GRACEFUL.  The /* TODO: fast exit */ and
  /* TODO: graceful exit */ comments are gone.

* src/nxt_runtime.h -- new uint8_t quit_mode next to other small
  flags (no struct bloat).  Documented as a nxt_port_quit_mode_t.
  Exports nxt_runtime_quit_buf() so nxt_application.c can use the
  same allocator.

* src/nxt_runtime.c -- nxt_runtime_quit_buf(task, quit_param)
  returns NULL for NORMAL (no allocation), one-byte buffer for
  GRACEFUL, and a NXT_LOG_WARN + NULL on alloc failure.  Both
  nxt_runtime_stop_app_processes() and
  nxt_runtime_stop_all_processes() call it with rt->quit_mode and
  release the buffer on send failure.

* src/nxt_application.c -- nxt_proto_quit_handler() reads the
  quit_param byte from msg, normalises unknown values to NORMAL,
  and forwards it via the new
  nxt_proto_quit_children(task, quit_param) signature with the
  same buffer-release-on-failure handling.  Direct signal handler
  nxt_proto_sigterm_handler() passes NXT_PORT_QUIT_NORMAL
  explicitly: signals to the prototype are not the user-initiated
  lifecycle path (that is main -> NXT_PORT_MSG_QUIT) and the
  historical fast-exit semantics are preserved.

* src/nxt_port.h -- promotes nxt_port_quit_mode_t to a public
  enum alongside NXT_PORT_MSG_QUIT.

* src/nxt_unit.c -- existing local NXT_QUIT_NORMAL / NXT_QUIT_GRACEFUL
  identifiers (used at 10+ libunit call sites) become #define
  aliases of the public names so a compile-time mismatch between
  the daemon-side and libunit-side values is structurally
  impossible -- the preprocessor substitutes the same enum value
  into every reference.  No churn at the call sites.

NXT_PORT_MSG_QUIT call-site audit
---------------------------------
  src/nxt_runtime.c:511      stop_app_processes      plumbed (rt->quit_mode)
                                                     + buffer release on
                                                     send failure
  src/nxt_runtime.c:533      stop_all_processes      plumbed (rt->quit_mode)
                                                     + buffer release on
                                                     send failure
  src/nxt_application.c:716  proto_quit_children     plumbed (cascaded byte)
                                                     + buffer release on
                                                     send failure
  src/nxt_main_process.c:1038 orphan reaping         NULL (defensive cleanup)
  src/nxt_router.c:932       prototype replaced      NULL (P6 territory)
  src/nxt_router.c:4536-4600 port-ready handlers     NULL (P6 territory)
  src/nxt_router.c:5043      idle-pool shrink        NULL (NORMAL is right)
  src/nxt_router.c:5142      app-free cleanup        NULL (out of P1 scope)

Phases P5/P6 will revisit the router sites once the listener
drain and reload endpoint exist.

Tests
-----
test/test_graceful_reload.py is new.  Three functional tests plus
one skipped placeholder:

  * test_sigquit_completes_inflight_request: SIGQUIT to main must
    take libunit's GRACEFUL branch.  Asserts on the *absence* of
    "active request on ctx quit" at nxt_unit.c:5816 -- that
    marker fires only in the NORMAL branch's force-close loop, so
    its absence is positive evidence GRACEFUL was taken.  Uses
    the ASGI delayed app so libunit's add_reader can dispatch the
    QUIT mid-request (a synchronous WSGI worker blocked in
    time.sleep would not pump libunit's message loop and the test
    would pass for the wrong reason).  We do *not* assert on the
    response body because P1 plumbs GRACEFUL through libunit only;
    the router still tears down on QUIT (router-side drain is P5),
    so the client TCP connection RSTs the moment the router exits
    regardless of whether the worker drains gracefully.

  * test_sigterm_drops_inflight_request: asserts the *presence*
    of the same marker -- positive evidence the NORMAL fast-exit
    branch ran.  Inverse of the SIGQUIT test.

  * test_sigint_takes_normal_path: regression guard against signal
    table edits that would re-route SIGINT through the SIGQUIT
    handler.  Same shape as the SIGTERM test.

  * test_quit_message_carries_quit_param is a skipped placeholder
    documenting the wire-format intent; verifying the byte
    directly would require C-level instrumentation.

A module-scoped autouse fixture skips the file with an actionable
message when --restart is missing -- the autouse run fixture in
conftest.py crashes teardown otherwise (PUT /config to a dead
daemon).

Verified
--------
  ./configure --tests --modules=python && ./configure python \
    --config=python3-config && make -j$(nproc)              # clean
  python3 -m pytest test/test_graceful_reload.py --restart  # 3 pass, 1 skip
  python3 -m pytest test/test_idle_close_wait.py --restart  # 2 pass
  python3 -m pytest test/test_procman.py --restart          # 11 pass, 2 skip,
                                                            # 1 pre-existing
                                                            # flake unrelated
                                                            # to P1
  ASAN build (-fsanitize=address):
    0 leaks attributable to nxt_runtime_quit_buf or any P1 code
    path.  Sole LeakSanitizer report is a 136-byte pre-existing
    leak in nxt_var_index_init (src/nxt_var.c:296), reproducible
    on plain origin/master with no P1 changes -- separate issue.
@andypost andypost force-pushed the p1-graceful-signal-split branch from b26a5d5 to 40f1c50 Compare May 7, 2026 23:38
@andypost
Copy link
Copy Markdown
Owner Author

andypost commented May 7, 2026

All five actionable findings addressed in 40f1c50 (force-pushed, squashed onto a single commit). Finding 6 (scope) was already correct and didn't need code changes.

Findings 1–5: applied

# Finding Where in 40f1c50
1 Buffer leak on nxt_port_socket_write failure src/nxt_runtime.c (both stop_app_processes and stop_all_processes) and src/nxt_application.c nxt_proto_quit_children — all three send sites now capture the return value and run b->completion_handler(task, b, b->parent) when != NXT_OK && b != NULL. Same shape as PR #8's port-write-failure cleanup.
2 SIGQUIT test should use ASGI test_sigquit_completes_inflight_request now uses module='asgi'. See note below — the assertion shape also had to change.
3 Normalize unknown quit_param values nxt_proto_quit_handler now clamps anything that isn't NXT_PORT_QUIT_GRACEFUL to NXT_PORT_QUIT_NORMAL after reading the byte. Comment updated to explain the wire contract.
4 Visibility on graceful-alloc failure nxt_runtime_quit_buf now logs NXT_LOG_WARN ("graceful quit payload allocation failed; this cascade leg falls back to fast exit") on nxt_buf_mem_alloc failure. The fallback-to-NORMAL behaviour is preserved.
5 Remove unused _wait_for_socket_closed helper Dropped from test/test_graceful_reload.py.

Note on Finding 2: SIGQUIT test had to change shape, not just module

Switching to ASGI surfaced a real architectural fact about P1's scope that the original test didn't account for: P1 plumbs GRACEFUL through libunit only — the router process still tears down on NXT_PORT_MSG_QUIT immediately (router-side drain is P5). So even when the app worker's libunit takes the GRACEFUL branch and drains the in-flight ASGI coroutine, the router exits before the response can be forwarded back to the client. The TCP connection RSTs and the test gets b'' on both NORMAL and GRACEFUL — b'200' in body is therefore useless as a P1 regression guard.

Confirmed in the unit log:

23:33:56  router started
23:33:57  router exited (1 second after start, ~immediately on SIGQUIT)
23:34:01  app process 24806 exited (5 seconds after start — GRACEFUL drain worked)

The unambiguous positive evidence for the GRACEFUL branch is the contrapositive of the SIGTERM test:

  • nxt_unit.c:5816 "active request on ctx quit" fires from the for-loop inside nxt_unit_quit only when quit_param == NXT_PORT_QUIT_NORMAL AND active_req is non-empty.
  • The GRACEFUL branch returns early at line 5793 when the queue is non-empty, so that for-loop is unreachable.
  • Absence of the marker is positive evidence GRACEFUL was taken.

The SIGQUIT test now waits 1.5 s after the signal (long enough for libunit to dispatch the QUIT message but well below the 5 s in-flight delay) and asserts Log.findall(r'active request on ctx quit') == []. This is a tight regression check: a real downgrade to NORMAL would emit the marker, and the same-named SIGTERM test on the next page asserts the marker IS present — symmetric pair, hard to fool.

I expanded the test's docstring to explain why we don't assert on body content, and to call out that router-side drain is P5 territory.

Verification

./configure --tests --modules=python && make -j$(nproc)                # clean
python3 -m pytest test/test_graceful_reload.py --restart -v
  test_sigquit_completes_inflight_request   PASSED
  test_sigterm_drops_inflight_request       PASSED
  test_sigint_takes_normal_path             PASSED
  test_quit_message_carries_quit_param      SKIPPED  (intentional placeholder)
python3 -m pytest test/test_idle_close_wait.py --restart  =>  2 passed
python3 -m pytest test/test_procman.py --restart          => 11 passed,
                                                             2 skipped,
                                                             1 pre-existing flake

Follow-up captured

The suggested patch checklist's Follow-up section is consistent with the plan in roadmap/plan-graceful-shutdown.md:

Thanks for the careful review.


Generated by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants