feat: liveness probes, recovery strategies, and skip_if rename#111
Merged
peter-adam-dy merged 1 commit intomainfrom Mar 27, 2026
Merged
feat: liveness probes, recovery strategies, and skip_if rename#111peter-adam-dy merged 1 commit intomainfrom
peter-adam-dy merged 1 commit intomainfrom
Conversation
This was referenced Mar 24, 2026
e643672 to
7e00487
Compare
7e00487 to
d592bb1
Compare
boennemann
added a commit
that referenced
this pull request
Mar 24, 2026
listen now returns ALL pending events at once (batch mode) and auto-claims
their threads so multiple agents can work in parallel without conflicts.
- Add claimed_by/claimed_at fields to Thread struct
- Add ThreadClaimed/ThreadReleased event types
- Add claim_thread/release_thread/is_claimed methods to FeedbackStore
- Change listen to batch mode by default (--no-batch for legacy)
- Auto-claim threads on listen, skip already-claimed threads
- Add --agent flag (default: agent-<pid>) for agent identity
- Add release subcommand with atomic comment + release
- Add POST /feedback/api/threads/{id}/release HTTP endpoint
- Add claim badges and release button to browser overlay
- Handle thread_claimed/thread_released events in frontend polling
- Extract process_batch() for testable batch orchestration logic
- 18 new tests (7 store-level, 11 batch-logic)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2c53ce2 to
f53e63f
Compare
a7da4e8 to
c914dde
Compare
d2bcf25 to
d9a6c48
Compare
Add liveness probes that continuously monitor node health after startup, automatically restart the environment when failures are detected, and provide full observability via internal logs. ## Liveness probes - New `probes` field on all variant types with `readiness` (replaces legacy `health_check`) and `liveness` (runs continuously after healthy) - Liveness probes support `http`, `port`, and `command` check types - Configurable `interval_ms`, `failure_threshold`, and `max_recoveries` - Probe commands receive node outputs as env vars and inherit user PATH - Probe stderr is captured and logged for debugging ## Automatic recovery - When `failure_threshold` consecutive liveness checks fail, the daemon automatically restarts the environment via `veld restart` - Recovery count persists across restarts; after `max_recoveries` exhausted, the node is marked permanently failed - Daemon resolves user's full PATH (including Homebrew) at startup and refreshes every 60s to handle boot-before-login scenarios ## Internal logs - New per-run log file `.veld/logs/<run>/_veld.log` with timestamped entries for probe runs/passes/failures, recovery triggers, and start/stop lifecycle events - Accessible via `veld logs --source internal` (CLI + follow mode) - Management UI "Internal" tab in log source filter - Probe failure entries include command stderr for debugging ## Other changes - Rename `verify` to `skip_if` (alias preserved for backward compat) - Rename `health_check` to `probes.readiness` (legacy field still works) - Rename `HealthCheckPhase/Attempt/Passed` progress events to `ReadinessProbePhase/Attempt/Passed` - Rename `health_phases` state field to `readiness_phases` (alias preserved) - User-facing messages say "readiness" instead of "health check" - `veld init` generates schema v2 with `probes.readiness` - `veld status` shows liveness section (failures, recoveries, last error) - Management UI shows recovery info per node and unhealthy/recovering badges - `veld update` and `install.sh` stop all running environments before installing (with confirmation prompt) to prevent stale state files - New `get_dependents()` graph utility for future selective restart - `NodeStatus::Unhealthy` and `RunStatus::Recovering` state variants (forward-compatible scaffolding) BREAKING CHANGE: Schema version bumped to "2" with new `schema/v2/` directory. The `health_check` field is deprecated in favor of `probes.readiness`. The `verify` field is renamed to `skip_if`. Both old names are accepted as aliases. Progress event JSON keys renamed from `health_check_*` to `readiness_probe_*`. State field `health_phases` renamed to `readiness_phases` (alias preserved for deserialization). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d9a6c48 to
4d3d98a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
commandandstart_servernodes to detect unhealthy state (dropped SSH tunnels, crashed processes, unreachable databases)notify,restart_node_and_dependents,restart_environment) configurable at project/node/variant levelverifytoskip_iffor clarity (alias preserved)"2"(v1 fully backward compatible)veld restartsubprocessNew config shape
{ "schemaVersion": "2", "recovery_strategy": "restart_node_and_dependents", "nodes": { "database": { "variants": { "dblab": { "type": "command", "command": "./scripts/dblab/start.sh veld-${veld.run}", "on_stop": "./scripts/dblab/stop.sh veld-${veld.run}", "outputs": ["DATABASE_URL"], "skip_if": "./scripts/dblab/verify.sh veld-${veld.run}", "probes": { "liveness": { "type": "command", "command": "pg_isready -h ${DB_HOST} -p ${DB_PORT}", "interval_ms": 5000, "failure_threshold": 3, "max_recoveries": 3 } } } } } } }Key changes
probes(readiness + liveness),recovery_strategy,skip_if,LivenessProbedefRecoveryStrategyenum,ProbesConfig,LivenessProbe,resolve_recovery_strategy(),readiness_probe(),liveness_probe()NodeStatus::Unhealthy,RunStatus::Recovering, liveness tracking fieldsprobes.readiness(fallback tohealth_check), readiness probes for command nodes,skip_ifrenameveld restartsubprocessget_dependents()for recovery subgraph identificationstop_subgraph(), state helpers, strategy resolutionTest plan
cargo test)cargo clippy)verifyalias works forskip_ifprobes.readinesstakes precedence overhealth_checkget_dependents()correctly identifies transitive dependents🤖 Generated with Claude Code