Sanskrit: logic, reasoning, structured argument
An on-call oriented agent that turns Prometheus/Alertmanager alerts into actionable triage reports.
Designed for small teams to reduce "tribal knowledge" during incidents by producing consistent, honest, copy/paste-friendly investigation narratives.
- Quickstart Guide - First investigation in 5 minutes
- One Pager - Leadership overview and motivation
- Full Documentation - Complete documentation hub
Converts Prometheus/Alertmanager alerts into triage reports with:
- Deterministic base triage: Explicit about what's known and unknown (no guessing)
- Alert-specific playbooks: CPU throttling, OOM, HTTP 5xx, pod health, etc.
- Multi-source evidence: Prometheus metrics + Kubernetes context + logs (all best-effort)
- Read-only operations: Safe investigation, no cluster mutations
- Flexible deployment: Run as CLI or in-cluster webhook service
See Triage Methodology for the philosophy.
Case Inbox -- All active alerts in one place, scored and classified automatically.
Leadership Dashboard -- ROI metrics, signal quality, incident trends, and engineer hours saved.
Triage Report -- Structured evidence, verdict, and copy-paste-ready next steps for every alert.
Case Chat -- Ask follow-up questions about a specific case. The agent has full context of the investigation.
Global Chat -- Query across all cases with tool-using AI (PromQL, kubectl, log search, and more).
Full example inputs and reports are available in the examples/ directory:
examples/reports/pod-crashloop/report.md-- rendered triage reportexamples/reports/pod-crashloop/investigation.json-- structured JSON analysis
# Install dependencies (base + LLM provider)
poetry install # Base installation (no LLM)
poetry install -E vertex # Base + Vertex AI (Gemini)
poetry install -E anthropic # Base + Anthropic (Claude)
poetry install -E all-providers # Base + all LLM providers
# List active alerts
poetry run python main.py --list-alerts
# Investigate a specific alert
poetry run python main.py --alert 0
# Investigate with LLM enrichment (optional)
poetry run python main.py --alert 0 --llmLLM Provider Configuration:
- Vertex AI: Set
LLM_PROVIDER=vertexai,GOOGLE_CLOUD_PROJECT,GOOGLE_CLOUD_LOCATION - Anthropic: Set
LLM_PROVIDER=anthropic,ANTHROPIC_API_KEY - See Multi-Provider LLM Guide for details
For detailed setup, see Quickstart Guide.
Start a complete local development environment with one command:
# Copy environment template
cp .env.example .env
# Start PostgreSQL, NATS, and mock monitoring services
make dev-up
# Start webhook server (Terminal 2)
make dev-serve
# Start UI dev server (Terminal 3)
make dev-uiAccess the UI: http://localhost:5173
- Username:
admin - Password:
admin123(or from.env)
Mock Services: The local environment includes mock Prometheus/Alertmanager/Logs that return empty data, allowing you to test the full pipeline without real infrastructure. To use real services, port-forward and update .env.
Stop services:
make dev-downSee the Local Development Guide for detailed instructions, troubleshooting, and advanced usage.
- Getting Started: Quickstart • Local Development • Authentication • Environment Variables
- Operating: Deployment • Operations • Testing
- Architecture: Overview • Investigation Pipeline • Diagnostics • Playbooks
- Integrations: Slack App Setup • GitHub App Setup
- Extending: Adding Playbooks • Triage Methodology
- Features: Chat • Actions • Memory • Multi-Provider LLM
- Roadmap: Completed • Planned
- Prometheus-compatible API (required) - For metrics and scope calculation
- Kubernetes API (optional but recommended) - For pod context and events
- VictoriaLogs (optional) - For log evidence (agent remains useful without logs)
We welcome contributions! See CONTRIBUTING.md for development setup, coding standards, and PR guidelines.
Licensed under the Apache License 2.0.
© 2026 Dinesh Auti and the Tarka Contributors.




