Note: the agent is pivoting toward diagnostic modules (universal failure modes) as the core abstraction. Playbooks remain useful as evidence collectors, but new “general” behavior should prefer adding/extending a diagnostic module. See:
diagnostic-modules.md.
This repo is designed so you can add investigation depth without breaking the base “on-call trust” contract.
- Docs portal:
docs/README.md - Triage methodology (the quality bar):
triage-methodology.md - Acceptance criteria:
README.md
A playbook’s job is to populate Investigation.evidence using external systems:
- Prometheus (PromQL / MetricsQL)
- Kubernetes API (read-only)
- Logs backend (best-effort)
Rules:
- Best-effort: never fail the whole report if one source is down.
- Idempotent: don’t overwrite evidence that already exists unless you mean to.
- Honest errors: record failures in the investigation so the report can state what’s missing and why.
An enricher should only transform existing evidence into:
- a compact label (“what’s most likely going on”)
- a short list of why-bullets backed by evidence
- next steps (PromQL-first, plus optional kubectl fallback)
It should not perform external calls. It should never contradict base triage honesty.
- Routing and playbook registration:
agent/playbooks.py - Base triage contract and scenarios:
docs/acceptance/(specs), and the code that populatesanalysis.decision - Common evidence helpers:
- Kubernetes context:
agent/k8s_context.py - Prometheus queries:
agent/prometheus.py - Logs backend:
agent/logs_victorialogs.py(andagent/loki.pywhere applicable)
- Kubernetes context:
Start with your highest-cost paging alerts (high frequency or high cognitive load). For each family, define:
- target type (pod/workload/service/node/cluster/unknown)
- the top 2–3 discriminators you want the report to surface
- the “first command” you want on-call to run if the agent is blocked
Write down what “good” looks like in terms of base triage + enrichment:
- Base triage must still be valid under Scenarios A–D
- Enrichment should add concrete, evidence-backed discriminators
- Next steps should be copy/paste friendly and ordered by signal/effort
The base contract lives here: base-contract.md
In agent/playbooks.py:
- route your
alertname(s) to a new playbook - collect only what you need; prefer shared baselines if/when present
- ensure you record evidence availability status (ok/empty/unavailable) rather than silently skipping
Interpret the evidence you collected into a small, stable output. Keep it boring and predictable:
- no guessing
- no external calls
- make unknowns explicit
The repo has a strong test suite under tests/.
When adding a new family or changing triage behavior, add tests that lock in:
- deterministic base triage behavior under blocked scenarios
- the key discriminators for your family (golden outputs)
- regression coverage for parsing/label extraction where relevant
A new playbook/enricher is “done” when:
- Base triage remains honest: unknowns are explicit; no invented scope/impact/identity
- Scenario coverage: A–D produce sensible
decision.label/why/next - Evidence status is explicit: k8s/metrics/logs show ok vs unavailable vs empty
- PromQL-first next steps are present, with optional
kubectlfallback - No duplicated collection: avoid fetching the same baseline evidence in multiple places
- Tests added/updated to cover the new family and blocked behaviors
If you find yourself duplicating “pod baseline” queries (K8s context + restarts + cpu/memory + logs), consider moving toward the shared playbook approach described here: