You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest using skill-eval-harness
Summary
Directive has no empirical evidence that it improves agent behavior. The ADR-001 empirical validation vBRIEF (2026-05-01-742-adr-001-empirical-validation) references deft-agent-bench as planned benchmark infrastructure, but no benchmark has shipped. This issue tracks creating a concrete benchmark manifest at evals/shared-benchmark.json that tests directive's core behavioral rules using skill-eval-harness (https://github.com/adewale/skill-eval-harness).
The harness fit
skill-eval-harness is runner-agnostic: it prepares tasks as JSONL, expects outputs in a standard directory layout, then grades and aggregates. The with_skill vs without_skill paired variant maps directly to the directive evaluation question:
without_skill: agent receives the task prompt with no AGENTS.md in the working directory
with_skill: agent runs in a directive-configured directory where AGENTS.md loads main.md and the relevant skills
Branch policy -- does directive prevent the agent from committing directly to main?
Assertion: with_skill output contains branch creation; without_skill output may not.
task check gate -- does directive ensure the agent runs task check before committing?
Script assertion: check whether the agent invoked task check in its output.
Anomaly-first summaries -- does the anomaly-first summary rule in main.md change agent PR descriptions?
Judge assertion with rubric: does the summary lead with risks/anomalies before good news?
vBRIEF lifecycle compliance -- does directive keep scope vBRIEFs in the correct lifecycle folder?
File assertion: vbrief/active/ contains the expected file after activation.
Required adapter
Directive is interactive-session-first. A non-interactive runner script is needed for step 3 of the harness workflow: invoke claude -p (or equivalent) per task, write output.md and metadata.json to the expected path. This is a one-time setup, not a harness change.
Proposed files
evals/shared-benchmark.json -- benchmark manifest with tune cases for the five behavioral rules above
evals/fixtures/ -- fixture files for cases that need them (diff patches, repo snapshots)
The deft-agent-bench repo referenced in the ADR-001 vBRIEF is adjacent infrastructure; this issue is directive's own benchmark manifest
Related to A1.2 (ablation suite) -- the ablations array extends this manifest once the base cases are established
Related to A1.3 (trigger coverage) -- skill-pi-trigger-eval is a companion eval to this manifest
Source
https://github.com/adewale/skill-eval-harness -- Python CLI for evaluating agent skills with paired with_skill/without_skill variants, anti-overfitting splits, repeated-run statistics, and deterministic grading. README: "Use it when you want to answer: does this skill improve the agent, where does it fail, and is the eval itself discriminating?"
feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest using skill-eval-harness
Summary
Directive has no empirical evidence that it improves agent behavior. The ADR-001 empirical validation vBRIEF (2026-05-01-742-adr-001-empirical-validation) references deft-agent-bench as planned benchmark infrastructure, but no benchmark has shipped. This issue tracks creating a concrete benchmark manifest at evals/shared-benchmark.json that tests directive's core behavioral rules using skill-eval-harness (https://github.com/adewale/skill-eval-harness).
The harness fit
skill-eval-harness is runner-agnostic: it prepares tasks as JSONL, expects outputs in a standard directory layout, then grades and aggregates. The with_skill vs without_skill paired variant maps directly to the directive evaluation question:
The harness supports: repeated-run statistics, holdout/holdback anti-overfitting splits, objective assertions (contains, regex, file_exists, script), qualitative judge handoff, ablation variants, and Anthropic-compatible grading.json exports.
What the benchmark manifest should cover
Core behavioral rules to test with the with_skill vs without_skill paired comparison:
Implementation intent gate (Implementation-intent inference is non-deterministic: agents spawn coding sub-agents from lifecycle/PR-process language without explicit action-verb authorization #810) -- does directive prevent the agent from writing code before a spec/vBRIEF exists?
Assertion: without_skill agent starts coding; with_skill agent asks for a plan first.
Branch policy -- does directive prevent the agent from committing directly to main?
Assertion: with_skill output contains branch creation; without_skill output may not.
task check gate -- does directive ensure the agent runs task check before committing?
Script assertion: check whether the agent invoked task check in its output.
Anomaly-first summaries -- does the anomaly-first summary rule in main.md change agent PR descriptions?
Judge assertion with rubric: does the summary lead with risks/anomalies before good news?
vBRIEF lifecycle compliance -- does directive keep scope vBRIEFs in the correct lifecycle folder?
File assertion: vbrief/active/ contains the expected file after activation.
Required adapter
Directive is interactive-session-first. A non-interactive runner script is needed for step 3 of the harness workflow: invoke claude -p (or equivalent) per task, write output.md and metadata.json to the expected path. This is a one-time setup, not a harness change.
Proposed files
Acceptance criteria
Related
Source
https://github.com/adewale/skill-eval-harness -- Python CLI for evaluating agent skills with paired with_skill/without_skill variants, anti-overfitting splits, repeated-run statistics, and deterministic grading. README: "Use it when you want to answer: does this skill improve the agent, where does it fail, and is the eval itself discriminating?"