feat: liveness probes, recovery strategies, and skip_if rename by peter-adam-dy · Pull Request #111 · prosperity-solutions/veld

peter-adam-dy · 2026-03-24T14:04:18Z

Summary

Adds k8s-inspired readiness and liveness probes for both command and start_server nodes to detect unhealthy state (dropped SSH tunnels, crashed processes, unreachable databases)
Adds recovery strategies (notify, restart_node_and_dependents, restart_environment) configurable at project/node/variant level
Renames verify to skip_if for clarity (alias preserved)
Bumps schema version to "2" (v1 fully backward compatible)
Daemon automatically detects liveness failures and triggers recovery via veld restart subprocess

New config shape

{
  "schemaVersion": "2",
  "recovery_strategy": "restart_node_and_dependents",
  "nodes": {
    "database": {
      "variants": {
        "dblab": {
          "type": "command",
          "command": "./scripts/dblab/start.sh veld-${veld.run}",
          "on_stop": "./scripts/dblab/stop.sh veld-${veld.run}",
          "outputs": ["DATABASE_URL"],
          "skip_if": "./scripts/dblab/verify.sh veld-${veld.run}",
          "probes": {
            "liveness": {
              "type": "command",
              "command": "pg_isready -h ${DB_HOST} -p ${DB_PORT}",
              "interval_ms": 5000,
              "failure_threshold": 3,
              "max_recoveries": 3
            }
          }
        }
      }
    }
  }
}

Key changes

Area	Changes
Schema	`probes` (readiness + liveness), `recovery_strategy`, `skip_if`, `LivenessProbe` def
Config	`RecoveryStrategy` enum, `ProbesConfig`, `LivenessProbe`, `resolve_recovery_strategy()`, `readiness_probe()`, `liveness_probe()`
State	`NodeStatus::Unhealthy`, `RunStatus::Recovering`, liveness tracking fields
Orchestrator	Uses `probes.readiness` (fallback to `health_check`), readiness probes for command nodes, `skip_if` rename
Daemon	Liveness monitor with per-probe intervals, failure counting, recovery triggering, `veld restart` subprocess
Graph	`get_dependents()` for recovery subgraph identification
Recovery	New module: `stop_subgraph()`, state helpers, strategy resolution

Test plan

All 121 existing + new tests pass (cargo test)
Zero clippy warnings (cargo clippy)
Schema v1 configs still load (backward compat)
verify alias works for skip_if
probes.readiness takes precedence over health_check
Recovery strategy resolution: variant > node > global > default
get_dependents() correctly identifies transitive dependents
Manual test: command node with liveness probe detects failure and triggers restart

🤖 Generated with Claude Code

listen now returns ALL pending events at once (batch mode) and auto-claims their threads so multiple agents can work in parallel without conflicts. - Add claimed_by/claimed_at fields to Thread struct - Add ThreadClaimed/ThreadReleased event types - Add claim_thread/release_thread/is_claimed methods to FeedbackStore - Change listen to batch mode by default (--no-batch for legacy) - Auto-claim threads on listen, skip already-claimed threads - Add --agent flag (default: agent-<pid>) for agent identity - Add release subcommand with atomic comment + release - Add POST /feedback/api/threads/{id}/release HTTP endpoint - Add claim badges and release button to browser overlay - Handle thread_claimed/thread_released events in frontend polling - Extract process_batch() for testable batch orchestration logic - 18 new tests (7 store-level, 11 batch-logic) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add liveness probes that continuously monitor node health after startup, automatically restart the environment when failures are detected, and provide full observability via internal logs. ## Liveness probes - New `probes` field on all variant types with `readiness` (replaces legacy `health_check`) and `liveness` (runs continuously after healthy) - Liveness probes support `http`, `port`, and `command` check types - Configurable `interval_ms`, `failure_threshold`, and `max_recoveries` - Probe commands receive node outputs as env vars and inherit user PATH - Probe stderr is captured and logged for debugging ## Automatic recovery - When `failure_threshold` consecutive liveness checks fail, the daemon automatically restarts the environment via `veld restart` - Recovery count persists across restarts; after `max_recoveries` exhausted, the node is marked permanently failed - Daemon resolves user's full PATH (including Homebrew) at startup and refreshes every 60s to handle boot-before-login scenarios ## Internal logs - New per-run log file `.veld/logs/<run>/_veld.log` with timestamped entries for probe runs/passes/failures, recovery triggers, and start/stop lifecycle events - Accessible via `veld logs --source internal` (CLI + follow mode) - Management UI "Internal" tab in log source filter - Probe failure entries include command stderr for debugging ## Other changes - Rename `verify` to `skip_if` (alias preserved for backward compat) - Rename `health_check` to `probes.readiness` (legacy field still works) - Rename `HealthCheckPhase/Attempt/Passed` progress events to `ReadinessProbePhase/Attempt/Passed` - Rename `health_phases` state field to `readiness_phases` (alias preserved) - User-facing messages say "readiness" instead of "health check" - `veld init` generates schema v2 with `probes.readiness` - `veld status` shows liveness section (failures, recoveries, last error) - Management UI shows recovery info per node and unhealthy/recovering badges - `veld update` and `install.sh` stop all running environments before installing (with confirmation prompt) to prevent stale state files - New `get_dependents()` graph utility for future selective restart - `NodeStatus::Unhealthy` and `RunStatus::Recovering` state variants (forward-compatible scaffolding) BREAKING CHANGE: Schema version bumped to "2" with new `schema/v2/` directory. The `health_check` field is deprecated in favor of `probes.readiness`. The `verify` field is renamed to `skip_if`. Both old names are accepted as aliases. Progress event JSON keys renamed from `health_check_*` to `readiness_probe_*`. State field `health_phases` renamed to `readiness_phases` (alias preserved for deserialization). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This was referenced Mar 24, 2026

State file TOCTOU race between daemon and CLI #112

Open

Add output_key field to readiness probes for command nodes #113

Open

Populate last_liveness_error with actual probe failure details #114

Open

peter-adam-dy force-pushed the recovering-unhealthy-state branch 10 times, most recently from e643672 to 7e00487 Compare March 24, 2026 17:50

peter-adam-dy mentioned this pull request Mar 24, 2026

feat: internal daemon log stream per run for liveness/recovery observability #115

Closed

peter-adam-dy force-pushed the recovering-unhealthy-state branch from 7e00487 to d592bb1 Compare March 24, 2026 18:13

peter-adam-dy force-pushed the recovering-unhealthy-state branch 6 times, most recently from 2c53ce2 to f53e63f Compare March 25, 2026 08:48

peter-adam-dy mentioned this pull request Mar 25, 2026

feat: expand internal log coverage (startup, stop, probe passes) #119

Closed

peter-adam-dy force-pushed the recovering-unhealthy-state branch 7 times, most recently from a7da4e8 to c914dde Compare March 25, 2026 10:27

peter-adam-dy force-pushed the recovering-unhealthy-state branch 8 times, most recently from d2bcf25 to d9a6c48 Compare March 25, 2026 14:54

peter-adam-dy marked this pull request as ready for review March 27, 2026 11:10

peter-adam-dy force-pushed the recovering-unhealthy-state branch from d9a6c48 to 4d3d98a Compare March 27, 2026 11:13

peter-adam-dy merged commit 12f94d4 into main Mar 27, 2026
16 checks passed

peter-adam-dy deleted the recovering-unhealthy-state branch March 27, 2026 11:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: liveness probes, recovery strategies, and skip_if rename#111

feat: liveness probes, recovery strategies, and skip_if rename#111
peter-adam-dy merged 1 commit intomainfrom
recovering-unhealthy-state

peter-adam-dy commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peter-adam-dy commented Mar 24, 2026

Summary

New config shape

Key changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant