Skip to content

feat(evals): trigger coverage evals for AGENTS.md skill routing — skill-pi-trigger-eval integration #1586

@visionik

Description

@visionik

feat(evals): trigger coverage evals for AGENTS.md skill routing using skill-pi-trigger-eval

Summary

AGENTS.md defines explicit skill-routing triggers (17 rules mapping keyword phrases to skills: "review cycle" -> deft-directive-review-cycle, "swarm" -> deft-directive-swarm, "triage" -> deft-directive-gh-triage, etc.). These routing rules have no automated test coverage. A false positive (wrong skill fires for an unrelated context) or false negative (correct skill does not fire when it should) is a silent correctness failure. skill-eval-harness includes skill-pi-trigger-eval for exactly this: smoke evals that check trigger keyword routing without forcing the skill flag.

The gap

Currently:

  • No test covers "does the agent correctly route to deft-directive-review-cycle when the user says run review cycle?"
  • No test covers "does the agent NOT trigger deft-directive-swarm when the user says something that contains the word swarm in an unrelated context?"
  • A AGENTS.md change that breaks a routing trigger is undetectable before shipping

What trigger coverage looks like

A trigger-cases.jsonl file with entries per skill:

For deft-directive-review-cycle:

  • Positive: "run review cycle on this PR" -> should trigger deft-directive-review-cycle
  • Positive: "check reviews" -> should trigger
  • Negative: "I want to review this design document" -> should NOT trigger (review without cycle context)

For deft-directive-swarm:

  • Positive: "run agents in parallel on these three vBRIEFs" -> should trigger
  • Positive: "swarm these stories" -> should trigger
  • Negative: "there was a swarm of bees outside" -> should NOT trigger

Proposed files

  • evals/trigger-cases.jsonl -- trigger test cases per skill in the format expected by skill-pi-trigger-eval
  • Taskfile.yml -- new task eval:triggers: runs skill-pi-trigger-eval against AGENTS.md

Acceptance criteria

  • evals/trigger-cases.jsonl has at least 2 positive and 1 negative case per trigger rule in AGENTS.md
  • task eval:triggers runs skill-pi-trigger-eval and passes
  • All positive cases pass (correct skill fires)
  • All negative cases pass (no false positives)
  • task check or CI runs task eval:triggers on AGENTS.md changes

Related

Source

https://github.com/adewale/skill-eval-harness: "Trigger checks -- run Pi skill-trigger smoke evals without forcing --skill" feature; skill-pi-trigger-eval CLI.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions