Skip to content

Commit 12f94d4

Browse files
feat: liveness probes, recovery strategies, and skip_if rename (#111)
## Summary - Adds k8s-inspired **readiness and liveness probes** for both `command` and `start_server` nodes to detect unhealthy state (dropped SSH tunnels, crashed processes, unreachable databases) - Adds **recovery strategies** (`notify`, `restart_node_and_dependents`, `restart_environment`) configurable at project/node/variant level - Renames `verify` to **`skip_if`** for clarity (alias preserved) - Bumps schema version to `"2"` (v1 fully backward compatible) - Daemon automatically detects liveness failures and triggers recovery via `veld restart` subprocess ### New config shape ```json { "schemaVersion": "2", "recovery_strategy": "restart_node_and_dependents", "nodes": { "database": { "variants": { "dblab": { "type": "command", "command": "./scripts/dblab/start.sh veld-${veld.run}", "on_stop": "./scripts/dblab/stop.sh veld-${veld.run}", "outputs": ["DATABASE_URL"], "skip_if": "./scripts/dblab/verify.sh veld-${veld.run}", "probes": { "liveness": { "type": "command", "command": "pg_isready -h ${DB_HOST} -p ${DB_PORT}", "interval_ms": 5000, "failure_threshold": 3, "max_recoveries": 3 } } } } } } } ``` ### Key changes | Area | Changes | |------|---------| | Schema | `probes` (readiness + liveness), `recovery_strategy`, `skip_if`, `LivenessProbe` def | | Config | `RecoveryStrategy` enum, `ProbesConfig`, `LivenessProbe`, `resolve_recovery_strategy()`, `readiness_probe()`, `liveness_probe()` | | State | `NodeStatus::Unhealthy`, `RunStatus::Recovering`, liveness tracking fields | | Orchestrator | Uses `probes.readiness` (fallback to `health_check`), readiness probes for command nodes, `skip_if` rename | | Daemon | Liveness monitor with per-probe intervals, failure counting, recovery triggering, `veld restart` subprocess | | Graph | `get_dependents()` for recovery subgraph identification | | Recovery | New module: `stop_subgraph()`, state helpers, strategy resolution | ## Test plan - [x] All 121 existing + new tests pass (`cargo test`) - [x] Zero clippy warnings (`cargo clippy`) - [x] Schema v1 configs still load (backward compat) - [x] `verify` alias works for `skip_if` - [x] `probes.readiness` takes precedence over `health_check` - [x] Recovery strategy resolution: variant > node > global > default - [x] `get_dependents()` correctly identifies transitive dependents - [ ] Manual test: command node with liveness probe detects failure and triggers restart 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 5c5262b commit 12f94d4

File tree

29 files changed

+2433
-423
lines changed

29 files changed

+2433
-423
lines changed

AGENTS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ When a change introduces new config fields, CLI flags, subcommands, or user-visi
7373
|------|----------------|
7474
| `README.md` | Features list, CLI reference table, Configuration section |
7575
| `docs/configuration.md` | Config field reference (top-level table, field section, variant table) |
76-
| `skills/veld-config/SKILL.md` | Agent-facing config reference |
77-
| `skills/veld-usage/SKILL.md` | Agent-facing CLI reference |
78-
| `schema/v1/veld.schema.json` | JSON Schema (usually updated in code, but verify) |
76+
| `skills/veld/SKILL.md` | Agent-facing skill (quick reference, gotchas) |
77+
| `skills/veld/reference/config.md` | Agent-facing config reference |
78+
| `schema/v2/veld.schema.json` | JSON Schema for v2 configs (probes, recovery, skip_if) |
7979
| `website/llms-full.txt` | LLM-facing docs (if applicable, see `website/AGENTS.md`) |
8080

8181
If the change is purely internal (refactor, bugfix with no new surface area), this checklist does not apply.

Cargo.lock

Lines changed: 5 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

PRD.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -724,7 +724,7 @@ No cross-compilation until v1 is stable. No Tauri. No GTK. No npm in CI.
724724
"local": {
725725
"type": "command",
726726
"script": "./scripts/clone-db.sh",
727-
"verify": "./scripts/verify-db.sh",
727+
"skip_if": "./scripts/verify-db.sh",
728728
"outputs": ["DATABASE_URL"],
729729
"sensitive_outputs": ["DATABASE_URL"]
730730
},

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ No port numbers. No manual wiring. Just clean, stable, human-readable URLs.
1717
- **No port numbers** — work with stable HTTPS URLs instead of `localhost:3847`
1818
- **Dependency graph** — resolves node dependencies, parallelizes startup, reverse-order teardown
1919
- **TLS by default** — Caddy's internal CA handles TLS termination, auto-trusted during setup
20-
- **Health checks** — two-phase checks (TCP port + HTTP endpoint) before marking services healthy
20+
- **Health checks** — readiness probes (two-phase: TCP port + HTTP/command) gate startup; liveness probes detect failures after startup (e.g., dropped SSH tunnels)
21+
- **Automatic recovery** — when liveness probes detect failure, the environment is automatically restarted (configurable failure threshold and max recovery attempts)
2122
- **Multiple variants** — same node, different behaviors (local server, Docker, remote URL)
2223
- **Named runs** — multiple environments coexist; re-running by name is idempotent
2324
- **Setup / teardown** — project-level lifecycle steps that gate startup (check Docker, create networks) and clean up after stop
@@ -26,6 +27,7 @@ No port numbers. No manual wiring. Just clean, stable, human-readable URLs.
2627
- **Structured output** — all commands support `--json` for scripting and CI
2728
- **Browser dashboard** — management UI at `https://veld.localhost` with service health, logs, search, stop/restart
2829
- **Client-side logs** — captures browser `console.log/warn/error`, exceptions, and promise rejections; view with `veld logs --source client`
30+
- **Internal logs** — liveness probe outcomes (with stderr), recovery decisions, health state transitions; view with `veld logs --source internal`
2931

3032
## Install
3133

@@ -72,8 +74,8 @@ cargo build --release
7274

7375
```json
7476
{
75-
"$schema": "https://veld.oss.life.li/schema/v1/veld.schema.json",
76-
"schemaVersion": "1",
77+
"$schema": "https://veld.oss.life.li/schema/v2/veld.schema.json",
78+
"schemaVersion": "2",
7779
"name": "myproject",
7880
"url_template": "{service}.{run}.{project}.localhost",
7981
"nodes": {
@@ -83,7 +85,7 @@ cargo build --release
8385
"local": {
8486
"type": "start_server",
8587
"command": "npm run dev -- --port ${veld.port}",
86-
"health_check": { "type": "http", "path": "/health", "timeout_seconds": 30 }
88+
"probes": { "readiness": { "type": "http", "path": "/health", "timeout_seconds": 30 } }
8789
}
8890
}
8991
},
@@ -93,7 +95,7 @@ cargo build --release
9395
"local": {
9496
"type": "start_server",
9597
"command": "npm run dev -- --port ${veld.port}",
96-
"health_check": { "type": "http", "path": "/", "timeout_seconds": 30 },
98+
"probes": { "readiness": { "type": "http", "path": "/", "timeout_seconds": 30 } },
9799
"depends_on": { "backend": "local" },
98100
"env": { "NEXT_PUBLIC_API_URL": "${nodes.backend.url}" }
99101
}
@@ -152,7 +154,7 @@ veld stop --name dev
152154
### Step types
153155

154156
- **`start_server`** — long-running process. Veld allocates a port (`${veld.port}`), starts the process, and runs health checks.
155-
- **`command`** — runs a command to completion. Can emit outputs by writing `key=value` lines to `$VELD_OUTPUT_FILE` (preferred) or via `VELD_OUTPUT key=value` on stdout (legacy, discouraged). Optional `verify` command for idempotency.
157+
- **`command`** — runs a command to completion. Can emit outputs by writing `key=value` lines to `$VELD_OUTPUT_FILE` (preferred) or via `VELD_OUTPUT key=value` on stdout (legacy, discouraged). Optional `skip_if` command for idempotency.
156158

157159
### Setup & teardown
158160

0 commit comments

Comments
 (0)