Skip to content

feat: liveness probes, recovery strategies, and skip_if rename#111

Merged
peter-adam-dy merged 1 commit intomainfrom
recovering-unhealthy-state
Mar 27, 2026
Merged

feat: liveness probes, recovery strategies, and skip_if rename#111
peter-adam-dy merged 1 commit intomainfrom
recovering-unhealthy-state

Conversation

@peter-adam-dy
Copy link
Copy Markdown
Collaborator

Summary

  • Adds k8s-inspired readiness and liveness probes for both command and start_server nodes to detect unhealthy state (dropped SSH tunnels, crashed processes, unreachable databases)
  • Adds recovery strategies (notify, restart_node_and_dependents, restart_environment) configurable at project/node/variant level
  • Renames verify to skip_if for clarity (alias preserved)
  • Bumps schema version to "2" (v1 fully backward compatible)
  • Daemon automatically detects liveness failures and triggers recovery via veld restart subprocess

New config shape

{
  "schemaVersion": "2",
  "recovery_strategy": "restart_node_and_dependents",
  "nodes": {
    "database": {
      "variants": {
        "dblab": {
          "type": "command",
          "command": "./scripts/dblab/start.sh veld-${veld.run}",
          "on_stop": "./scripts/dblab/stop.sh veld-${veld.run}",
          "outputs": ["DATABASE_URL"],
          "skip_if": "./scripts/dblab/verify.sh veld-${veld.run}",
          "probes": {
            "liveness": {
              "type": "command",
              "command": "pg_isready -h ${DB_HOST} -p ${DB_PORT}",
              "interval_ms": 5000,
              "failure_threshold": 3,
              "max_recoveries": 3
            }
          }
        }
      }
    }
  }
}

Key changes

Area Changes
Schema probes (readiness + liveness), recovery_strategy, skip_if, LivenessProbe def
Config RecoveryStrategy enum, ProbesConfig, LivenessProbe, resolve_recovery_strategy(), readiness_probe(), liveness_probe()
State NodeStatus::Unhealthy, RunStatus::Recovering, liveness tracking fields
Orchestrator Uses probes.readiness (fallback to health_check), readiness probes for command nodes, skip_if rename
Daemon Liveness monitor with per-probe intervals, failure counting, recovery triggering, veld restart subprocess
Graph get_dependents() for recovery subgraph identification
Recovery New module: stop_subgraph(), state helpers, strategy resolution

Test plan

  • All 121 existing + new tests pass (cargo test)
  • Zero clippy warnings (cargo clippy)
  • Schema v1 configs still load (backward compat)
  • verify alias works for skip_if
  • probes.readiness takes precedence over health_check
  • Recovery strategy resolution: variant > node > global > default
  • get_dependents() correctly identifies transitive dependents
  • Manual test: command node with liveness probe detects failure and triggers restart

🤖 Generated with Claude Code

@peter-adam-dy peter-adam-dy force-pushed the recovering-unhealthy-state branch 10 times, most recently from e643672 to 7e00487 Compare March 24, 2026 17:50
@peter-adam-dy peter-adam-dy force-pushed the recovering-unhealthy-state branch from 7e00487 to d592bb1 Compare March 24, 2026 18:13
boennemann added a commit that referenced this pull request Mar 24, 2026
listen now returns ALL pending events at once (batch mode) and auto-claims
their threads so multiple agents can work in parallel without conflicts.

- Add claimed_by/claimed_at fields to Thread struct
- Add ThreadClaimed/ThreadReleased event types
- Add claim_thread/release_thread/is_claimed methods to FeedbackStore
- Change listen to batch mode by default (--no-batch for legacy)
- Auto-claim threads on listen, skip already-claimed threads
- Add --agent flag (default: agent-<pid>) for agent identity
- Add release subcommand with atomic comment + release
- Add POST /feedback/api/threads/{id}/release HTTP endpoint
- Add claim badges and release button to browser overlay
- Handle thread_claimed/thread_released events in frontend polling
- Extract process_batch() for testable batch orchestration logic
- 18 new tests (7 store-level, 11 batch-logic)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@peter-adam-dy peter-adam-dy force-pushed the recovering-unhealthy-state branch 6 times, most recently from 2c53ce2 to f53e63f Compare March 25, 2026 08:48
@peter-adam-dy peter-adam-dy force-pushed the recovering-unhealthy-state branch 7 times, most recently from a7da4e8 to c914dde Compare March 25, 2026 10:27
@peter-adam-dy peter-adam-dy force-pushed the recovering-unhealthy-state branch 8 times, most recently from d2bcf25 to d9a6c48 Compare March 25, 2026 14:54
@peter-adam-dy peter-adam-dy marked this pull request as ready for review March 27, 2026 11:10
Add liveness probes that continuously monitor node health after startup,
automatically restart the environment when failures are detected, and
provide full observability via internal logs.

## Liveness probes

- New `probes` field on all variant types with `readiness` (replaces
  legacy `health_check`) and `liveness` (runs continuously after healthy)
- Liveness probes support `http`, `port`, and `command` check types
- Configurable `interval_ms`, `failure_threshold`, and `max_recoveries`
- Probe commands receive node outputs as env vars and inherit user PATH
- Probe stderr is captured and logged for debugging

## Automatic recovery

- When `failure_threshold` consecutive liveness checks fail, the daemon
  automatically restarts the environment via `veld restart`
- Recovery count persists across restarts; after `max_recoveries`
  exhausted, the node is marked permanently failed
- Daemon resolves user's full PATH (including Homebrew) at startup and
  refreshes every 60s to handle boot-before-login scenarios

## Internal logs

- New per-run log file `.veld/logs/<run>/_veld.log` with timestamped
  entries for probe runs/passes/failures, recovery triggers, and
  start/stop lifecycle events
- Accessible via `veld logs --source internal` (CLI + follow mode)
- Management UI "Internal" tab in log source filter
- Probe failure entries include command stderr for debugging

## Other changes

- Rename `verify` to `skip_if` (alias preserved for backward compat)
- Rename `health_check` to `probes.readiness` (legacy field still works)
- Rename `HealthCheckPhase/Attempt/Passed` progress events to
  `ReadinessProbePhase/Attempt/Passed`
- Rename `health_phases` state field to `readiness_phases` (alias preserved)
- User-facing messages say "readiness" instead of "health check"
- `veld init` generates schema v2 with `probes.readiness`
- `veld status` shows liveness section (failures, recoveries, last error)
- Management UI shows recovery info per node and unhealthy/recovering badges
- `veld update` and `install.sh` stop all running environments before
  installing (with confirmation prompt) to prevent stale state files
- New `get_dependents()` graph utility for future selective restart
- `NodeStatus::Unhealthy` and `RunStatus::Recovering` state variants
  (forward-compatible scaffolding)

BREAKING CHANGE: Schema version bumped to "2" with new `schema/v2/`
directory. The `health_check` field is deprecated in favor of
`probes.readiness`. The `verify` field is renamed to `skip_if`. Both old
names are accepted as aliases. Progress event JSON keys renamed from
`health_check_*` to `readiness_probe_*`. State field `health_phases`
renamed to `readiness_phases` (alias preserved for deserialization).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@peter-adam-dy peter-adam-dy force-pushed the recovering-unhealthy-state branch from d9a6c48 to 4d3d98a Compare March 27, 2026 11:13
@peter-adam-dy peter-adam-dy merged commit 12f94d4 into main Mar 27, 2026
16 checks passed
@peter-adam-dy peter-adam-dy deleted the recovering-unhealthy-state branch March 27, 2026 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant