Skip to content

feat!(alertd/doctor): replace per-check healthy bool with result enum#457

Merged
passcod merged 3 commits into
mainfrom
doctor-broken-query-warning
Jun 3, 2026
Merged

feat!(alertd/doctor): replace per-check healthy bool with result enum#457
passcod merged 3 commits into
mainfrom
doctor-broken-query-warning

Conversation

@passcod
Copy link
Copy Markdown
Member

@passcod passcod commented Jun 3, 2026

🤖 Replaces the per-check healthy: bool (plus skipped/ad-hoc flags) in the doctor wire format with a single result enum, and distinguishes broken healthchecks from unhealthy deployments.

Wire shape

Per-check entries in health[] now carry:

{ "check": "database", "result": "passed" }

with result: "passed" | "warning" | "failed" | "broken" | "skipped" — a proper sum type, exhaustively matchable on both sides:

  • passed — check ran, system OK
  • warning — check ran, system degraded but not fatally
  • failed — check ran, system under test is unhealthy
  • broken — the check itself errored/misconfigured; says nothing about the system
  • skipped — precondition not met; check didn't run

The top-level healthy flag is dropped from the status payload entirely.

Broken healthchecks

SQL-backed checks previously reported any query error as FAIL, so a stale query (dropped column, json/jsonb drift, missing function) was indistinguishable from a genuine alert condition and blamed the deployment. Query errors with SQLSTATE class 42 ("syntax error or access rule violation") now report as broken via a shared query_error_check helper applied to all SQL-backed checks; other database errors still fail. Broken degrades the overall result but is not fatal.

Other changes

  • warning goes on the wire as its own value rather than collapsing into healthy: false
  • checks that returned pass-with-a-skipped-flag (fhir_jobs, sync_sessions, tamanu_service) now return proper skips
  • overall_from_payload tiers on the per-check results: any failed → FAILING; any warning/broken → DEGRADED
  • the CLI renders broken checks as BRKN and counts them in the summary line

Note

The live-spec contract tests now run in a dedicated "Canopy contract" CI job, excluded from the platform test jobs. That job is red on status_request_matches_spec until canopy deploys the matching shape — live canopy's spec still requires per-check healthy.

passcod and others added 3 commits June 3, 2026 21:34
…ecks

When a check's SQL no longer matches the schema (dropped or renamed
columns, json/jsonb drift, missing functions — SQLSTATE class 42), the
fault is in the healthcheck, not the deployment. Report those as WARNING
with healthcheckBroken: true in the check details instead of FAIL, so
the server isn't flagged as failing while the broken query stays
visible. Other database errors still FAIL.

Co-authored-by: Claude <noreply@anthropic.com>
Per-check wire entries now carry result: passed | warning | failed |
broken | skipped instead of healthy: bool plus skipped/healthcheckBroken
flags, and the top-level healthy flag is dropped from the status payload.

- broken is a first-class CheckStatus: the check itself errored or is
  misconfigured and says nothing about the system; query_error_check
  produces it for class-42 SQL errors. Degrades the overall result but
  is not fatal.
- warning goes on the wire as its own value rather than collapsing into
  healthy: false.
- checks that returned pass with a skipped flag (fhir_jobs,
  sync_sessions, tamanu_service) now return proper skips.
- overall_from_payload tiers on the per-check results.
- the status_request_matches_spec contract test is red until canopy
  deploys the matching shape.

Co-authored-by: Claude <noreply@anthropic.com>
The contract tests fail honestly when live canopy is behind bestool,
which made the platform test jobs read as broken when the only failure
was spec drift. Mark them #[ignore] so plain cargo test skips them, and
run them in their own required CI job so a red check clearly means
"check the canopy deploy", not "the test suite is failing".

Co-authored-by: Claude <noreply@anthropic.com>
@passcod passcod added this pull request to the merge queue Jun 3, 2026
Merged via the queue into main with commit 8ebadec Jun 3, 2026
15 of 17 checks passed
@passcod passcod deleted the doctor-broken-query-warning branch June 3, 2026 12:00
@beyondessential-bot beyondessential-bot mentioned this pull request Jun 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant