feat(evals): trigger coverage evals for AGENTS.md skill routing using skill-pi-trigger-eval
Summary
AGENTS.md defines explicit skill-routing triggers (17 rules mapping keyword phrases to skills: "review cycle" -> deft-directive-review-cycle, "swarm" -> deft-directive-swarm, "triage" -> deft-directive-gh-triage, etc.). These routing rules have no automated test coverage. A false positive (wrong skill fires for an unrelated context) or false negative (correct skill does not fire when it should) is a silent correctness failure. skill-eval-harness includes skill-pi-trigger-eval for exactly this: smoke evals that check trigger keyword routing without forcing the skill flag.
The gap
Currently:
- No test covers "does the agent correctly route to deft-directive-review-cycle when the user says run review cycle?"
- No test covers "does the agent NOT trigger deft-directive-swarm when the user says something that contains the word swarm in an unrelated context?"
- A AGENTS.md change that breaks a routing trigger is undetectable before shipping
What trigger coverage looks like
A trigger-cases.jsonl file with entries per skill:
For deft-directive-review-cycle:
- Positive: "run review cycle on this PR" -> should trigger deft-directive-review-cycle
- Positive: "check reviews" -> should trigger
- Negative: "I want to review this design document" -> should NOT trigger (review without cycle context)
For deft-directive-swarm:
- Positive: "run agents in parallel on these three vBRIEFs" -> should trigger
- Positive: "swarm these stories" -> should trigger
- Negative: "there was a swarm of bees outside" -> should NOT trigger
Proposed files
- evals/trigger-cases.jsonl -- trigger test cases per skill in the format expected by skill-pi-trigger-eval
- Taskfile.yml -- new task eval:triggers: runs skill-pi-trigger-eval against AGENTS.md
Acceptance criteria
- evals/trigger-cases.jsonl has at least 2 positive and 1 negative case per trigger rule in AGENTS.md
- task eval:triggers runs skill-pi-trigger-eval and passes
- All positive cases pass (correct skill fires)
- All negative cases pass (no false positives)
- task check or CI runs task eval:triggers on AGENTS.md changes
Related
Source
https://github.com/adewale/skill-eval-harness: "Trigger checks -- run Pi skill-trigger smoke evals without forcing --skill" feature; skill-pi-trigger-eval CLI.
feat(evals): trigger coverage evals for AGENTS.md skill routing using skill-pi-trigger-eval
Summary
AGENTS.md defines explicit skill-routing triggers (17 rules mapping keyword phrases to skills: "review cycle" -> deft-directive-review-cycle, "swarm" -> deft-directive-swarm, "triage" -> deft-directive-gh-triage, etc.). These routing rules have no automated test coverage. A false positive (wrong skill fires for an unrelated context) or false negative (correct skill does not fire when it should) is a silent correctness failure. skill-eval-harness includes skill-pi-trigger-eval for exactly this: smoke evals that check trigger keyword routing without forcing the skill flag.
The gap
Currently:
What trigger coverage looks like
A trigger-cases.jsonl file with entries per skill:
For deft-directive-review-cycle:
For deft-directive-swarm:
Proposed files
Acceptance criteria
Related
Source
https://github.com/adewale/skill-eval-harness: "Trigger checks -- run Pi skill-trigger smoke evals without forcing --skill" feature; skill-pi-trigger-eval CLI.