feat!(alertd/doctor): replace per-check healthy bool with result enum#457
Merged
Conversation
…ecks When a check's SQL no longer matches the schema (dropped or renamed columns, json/jsonb drift, missing functions — SQLSTATE class 42), the fault is in the healthcheck, not the deployment. Report those as WARNING with healthcheckBroken: true in the check details instead of FAIL, so the server isn't flagged as failing while the broken query stays visible. Other database errors still FAIL. Co-authored-by: Claude <noreply@anthropic.com>
Per-check wire entries now carry result: passed | warning | failed | broken | skipped instead of healthy: bool plus skipped/healthcheckBroken flags, and the top-level healthy flag is dropped from the status payload. - broken is a first-class CheckStatus: the check itself errored or is misconfigured and says nothing about the system; query_error_check produces it for class-42 SQL errors. Degrades the overall result but is not fatal. - warning goes on the wire as its own value rather than collapsing into healthy: false. - checks that returned pass with a skipped flag (fhir_jobs, sync_sessions, tamanu_service) now return proper skips. - overall_from_payload tiers on the per-check results. - the status_request_matches_spec contract test is red until canopy deploys the matching shape. Co-authored-by: Claude <noreply@anthropic.com>
The contract tests fail honestly when live canopy is behind bestool, which made the platform test jobs read as broken when the only failure was spec drift. Mark them #[ignore] so plain cargo test skips them, and run them in their own required CI job so a red check clearly means "check the canopy deploy", not "the test suite is failing". Co-authored-by: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Replaces the per-check
healthy: bool(plusskipped/ad-hoc flags) in the doctor wire format with a singleresultenum, and distinguishes broken healthchecks from unhealthy deployments.Wire shape
Per-check entries in
health[]now carry:{ "check": "database", "result": "passed" }with
result: "passed" | "warning" | "failed" | "broken" | "skipped"— a proper sum type, exhaustively matchable on both sides:passed— check ran, system OKwarning— check ran, system degraded but not fatallyfailed— check ran, system under test is unhealthybroken— the check itself errored/misconfigured; says nothing about the systemskipped— precondition not met; check didn't runThe top-level
healthyflag is dropped from the status payload entirely.Broken healthchecks
SQL-backed checks previously reported any query error as FAIL, so a stale query (dropped column, json/jsonb drift, missing function) was indistinguishable from a genuine alert condition and blamed the deployment. Query errors with SQLSTATE class 42 ("syntax error or access rule violation") now report as
brokenvia a sharedquery_error_checkhelper applied to all SQL-backed checks; other database errors still fail. Broken degrades the overall result but is not fatal.Other changes
warninggoes on the wire as its own value rather than collapsing intohealthy: falseskipped-flag (fhir_jobs,sync_sessions,tamanu_service) now return proper skipsoverall_from_payloadtiers on the per-check results: anyfailed→ FAILING; anywarning/broken→ DEGRADEDBRKNand counts them in the summary lineNote
The live-spec contract tests now run in a dedicated "Canopy contract" CI job, excluded from the platform test jobs. That job is red on
status_request_matches_specuntil canopy deploys the matching shape — live canopy's spec still requires per-checkhealthy.