Skip to content

feat: auto-detect and mark dead agents on startup#44

Open
Solar2004 wants to merge 1 commit into
aannoo:mainfrom
Solar2004:feat/dead-agent
Open

feat: auto-detect and mark dead agents on startup#44
Solar2004 wants to merge 1 commit into
aannoo:mainfrom
Solar2004:feat/dead-agent

Conversation

@Solar2004

Copy link
Copy Markdown

Summary

On startup, hcom now scans all non-inactive instances with tracked PIDs. If the PID is no longer alive (e.g. after system reboot), it saves a life.stopped snapshot (compatible with hcom r <name>) and cleans up bindings. Hook-based agents without PIDs are handled by existing heartbeat staleness detection.

Changes

  • mark_dead_instances() in src/instance_lifecycle.rs — PID-based dead detection
  • Called from main.rs before router::dispatch()

On startup, hcom now scans all non-inactive instances with tracked PIDs.
If the PID is no longer alive (e.g. after system reboot), it saves a
life.stopped snapshot (compatible with hcom r <name>) and cleans up
bindings. Hook-based agents without PIDs are handled by existing
heartbeat staleness detection.

- mark_dead_instances() in instance_lifecycle.rs
- Called from main.rs before router::dispatch()
@aannoo

aannoo commented May 25, 2026

Copy link
Copy Markdown
Owner

AI review:

PR #44 - Dead-Agent Reboot Recovery

Branch: feat/dead-agent

Files touched: src/instance_lifecycle.rs, src/main.rs

Verdict

Salvage the idea, reject the implementation shape.

The useful capability is faster cleanup after reboot or process death. That belongs inside the existing lifecycle cleanup path. This PR adds a parallel cleanup function, runs it from main.rs on every CLI invocation, duplicates stop_instance, and uses a PID liveness check that can return false positives after PID reuse.

What The PR Adds

mark_dead_instances(db) scans every instance, skips inactive/launching/remote rows, calls pidtrack::is_alive(pid), snapshots dead rows, deletes binding/subscription/endpoint state, writes a stopped event, and deletes the instance.

Blockers

  1. Runs on every CLI invocation.

    The call is placed at src/main.rs:67-72, before router::dispatch(). Every hcom send, hcom list, hook callback, and plugin poll triggers a full iter_instances_full() scan plus a kill(pid, 0) syscall per row. This is not "on startup"; it is "on every hcom process." Concurrent CLI calls can also race on the DELETE for the same dead row and write duplicate stopped life events.

  2. Duplicates stop_instance.

    The new snapshot and cleanup logic mirrors stop_instance_inner in src/hooks/common.rs. The bindings/subscriptions/notify-endpoints cleanup is a near-copy. This risks divergent lifecycle behavior and means future changes have to be applied twice.

  3. PID reuse creates a correctness gap.

    pidtrack::is_alive(pid) is just kill(pid, 0). It only proves that some process has that PID. After reboot or long uptime, a recycled PID can make an unrelated process look like the original hcom child. is_alive returns true, mark_dead_instances skips the row, and the stale hcom row remains alive forever. Reboot is exactly when PID reuse is most likely, so the PR's stated value-add hits its weakest case.

What To Keep

  • The idea of using a syscall-based liveness signal instead of waiting for heartbeat decay.
  • Skipping remote rows (origin_device_id) is correct.
  • Skipping ST_INACTIVE and ST_LAUNCHING is correct.

Required Shape

  1. Move the check into the existing stale-instance cleanup path (cleanup_stale_instances in src/instance_lifecycle.rs).
  2. Gate the scan once per boot using a KV row keyed by /proc/sys/kernel/random/boot_id or platform equivalent.
  3. Use the existing stop_instance(db, name, "system", "exit:reboot") path for cleanup. Its inline comment confirms that SIGTERM/SIGKILL on a dead PID is a no-op (ESRCH/EPERM from initial killpg is fine — process already gone), so calling it on a dead PID is safe.
  4. Defend against PID reuse by checking a stronger process identity:
    • compare boot id plus process start time against the row's created_at (any process older than the row cannot be the original), or
    • if hcom plants a unique identifier in the child's environment (e.g. an HCOM_PROCESS_ID), read /proc/<pid>/environ and compare to the recorded marker, or
    • treat pre-boot PIDs as dead after boot-id change.

Merge Recommendation

Do not merge as-is. Accept a small refactor that adds PID-reuse-aware liveness to the existing lifecycle cleanup machinery.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants