Skip to content

Commit 2c53ce2

Browse files
peter-adam-dyclaude
andcommitted
feat!: liveness probes, recovery strategies, and skip_if rename
Add k8s-inspired readiness and liveness probes to detect unhealthy nodes (e.g., dropped SSH tunnels, crashed processes) and automatically recover. - New `probes` field on variants with `readiness` (replaces `health_check`) and `liveness` (runs continuously after node becomes healthy) - New `recovery_strategy` field at project/node/variant level: `notify`, `restart_node_and_dependents` (default), `restart_environment` - Rename `verify` to `skip_if` (alias preserved for backward compat) - Schema version bump to "2" (v1 still accepted) - New `Unhealthy` node status and `Recovering` run status - Daemon liveness monitor with per-probe interval, failure threshold, max recoveries, and automatic restart via `veld restart` subprocess - New `get_dependents()` graph utility for recovery subgraph identification - Recovery module for stopping/restarting node subgraphs BREAKING CHANGE: Schema version bumped to "2". The `health_check` field on start_server variants is deprecated in favor of `probes.readiness`. The `verify` field on command variants is renamed to `skip_if`. Both old field names are still accepted as aliases for backward compatibility, but new configs should use the new names. The `probes` field and `recovery_strategy` field are new additions available on all variant types. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 89b66f4 commit 2c53ce2

File tree

30 files changed

+2664
-368
lines changed

30 files changed

+2664
-368
lines changed

AGENTS.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -73,9 +73,9 @@ When a change introduces new config fields, CLI flags, subcommands, or user-visi
7373
|------|----------------|
7474
| `README.md` | Features list, CLI reference table, Configuration section |
7575
| `docs/configuration.md` | Config field reference (top-level table, field section, variant table) |
76-
| `skills/veld-config/SKILL.md` | Agent-facing config reference |
77-
| `skills/veld-usage/SKILL.md` | Agent-facing CLI reference |
78-
| `schema/v1/veld.schema.json` | JSON Schema (usually updated in code, but verify) |
76+
| `skills/veld/SKILL.md` | Agent-facing skill (quick reference, gotchas) |
77+
| `skills/veld/reference/config.md` | Agent-facing config reference |
78+
| `schema/v2/veld.schema.json` | JSON Schema for v2 configs (probes, recovery, skip_if) |
7979
| `website/llms-full.txt` | LLM-facing docs (if applicable, see `website/AGENTS.md`) |
8080

8181
If the change is purely internal (refactor, bugfix with no new surface area), this checklist does not apply.

Cargo.lock

Lines changed: 5 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

PRD.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -724,7 +724,7 @@ No cross-compilation until v1 is stable. No Tauri. No GTK. No npm in CI.
724724
"local": {
725725
"type": "command",
726726
"script": "./scripts/clone-db.sh",
727-
"verify": "./scripts/verify-db.sh",
727+
"skip_if": "./scripts/verify-db.sh",
728728
"outputs": ["DATABASE_URL"],
729729
"sensitive_outputs": ["DATABASE_URL"]
730730
},

README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ No port numbers. No manual wiring. Just clean, stable, human-readable URLs.
1717
- **No port numbers** — work with stable HTTPS URLs instead of `localhost:3847`
1818
- **Dependency graph** — resolves node dependencies, parallelizes startup, reverse-order teardown
1919
- **TLS by default** — Caddy's internal CA handles TLS termination, auto-trusted during setup
20-
- **Health checks** — two-phase checks (TCP port + HTTP endpoint) before marking services healthy
20+
- **Health checks** — readiness probes (two-phase: TCP port + HTTP/command) gate startup; liveness probes detect failures after startup (e.g., dropped SSH tunnels)
21+
- **Automatic recovery** — configurable recovery strategies (`notify`, `restart_node_and_dependents`, `restart_environment`) when liveness probes fail
2122
- **Multiple variants** — same node, different behaviors (local server, Docker, remote URL)
2223
- **Named runs** — multiple environments coexist; re-running by name is idempotent
2324
- **Setup / teardown** — project-level lifecycle steps that gate startup (check Docker, create networks) and clean up after stop
@@ -26,6 +27,7 @@ No port numbers. No manual wiring. Just clean, stable, human-readable URLs.
2627
- **Structured output** — all commands support `--json` for scripting and CI
2728
- **Browser dashboard** — management UI at `https://veld.localhost` with service health, logs, search, stop/restart
2829
- **Client-side logs** — captures browser `console.log/warn/error`, exceptions, and promise rejections; view with `veld logs --source client`
30+
- **Internal logs** — liveness probe outcomes (with stderr), recovery decisions, health state transitions; view with `veld logs --source internal`
2931

3032
## Install
3133

@@ -72,8 +74,8 @@ cargo build --release
7274

7375
```json
7476
{
75-
"$schema": "https://veld.oss.life.li/schema/v1/veld.schema.json",
76-
"schemaVersion": "1",
77+
"$schema": "https://veld.oss.life.li/schema/v2/veld.schema.json",
78+
"schemaVersion": "2",
7779
"name": "myproject",
7880
"url_template": "{service}.{run}.{project}.localhost",
7981
"nodes": {
@@ -83,7 +85,7 @@ cargo build --release
8385
"local": {
8486
"type": "start_server",
8587
"command": "npm run dev -- --port ${veld.port}",
86-
"health_check": { "type": "http", "path": "/health", "timeout_seconds": 30 }
88+
"probes": { "readiness": { "type": "http", "path": "/health", "timeout_seconds": 30 } }
8789
}
8890
}
8991
},
@@ -93,7 +95,7 @@ cargo build --release
9395
"local": {
9496
"type": "start_server",
9597
"command": "npm run dev -- --port ${veld.port}",
96-
"health_check": { "type": "http", "path": "/", "timeout_seconds": 30 },
98+
"probes": { "readiness": { "type": "http", "path": "/", "timeout_seconds": 30 } },
9799
"depends_on": { "backend": "local" },
98100
"env": { "NEXT_PUBLIC_API_URL": "${nodes.backend.url}" }
99101
}
@@ -152,7 +154,7 @@ veld stop --name dev
152154
### Step types
153155

154156
- **`start_server`** — long-running process. Veld allocates a port (`${veld.port}`), starts the process, and runs health checks.
155-
- **`command`** — runs a command to completion. Can emit outputs by writing `key=value` lines to `$VELD_OUTPUT_FILE` (preferred) or via `VELD_OUTPUT key=value` on stdout (legacy, discouraged). Optional `verify` command for idempotency.
157+
- **`command`** — runs a command to completion. Can emit outputs by writing `key=value` lines to `$VELD_OUTPUT_FILE` (preferred) or via `VELD_OUTPUT key=value` on stdout (legacy, discouraged). Optional `skip_if` command for idempotency.
156158

157159
### Setup & teardown
158160

0 commit comments

Comments
 (0)