feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest with with_skill vs without_skill paired cases

feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest using skill-eval-harness

## Summary

Directive has no empirical evidence that it improves agent behavior. The ADR-001 empirical validation vBRIEF (2026-05-01-742-adr-001-empirical-validation) references deft-agent-bench as planned benchmark infrastructure, but no benchmark has shipped. This issue tracks creating a concrete benchmark manifest at evals/shared-benchmark.json that tests directive's core behavioral rules using skill-eval-harness (https://github.com/adewale/skill-eval-harness).

## The harness fit

skill-eval-harness is runner-agnostic: it prepares tasks as JSONL, expects outputs in a standard directory layout, then grades and aggregates. The with_skill vs without_skill paired variant maps directly to the directive evaluation question:

- without_skill: agent receives the task prompt with no AGENTS.md in the working directory
- with_skill: agent runs in a directive-configured directory where AGENTS.md loads main.md and the relevant skills

The harness supports: repeated-run statistics, holdout/holdback anti-overfitting splits, objective assertions (contains, regex, file_exists, script), qualitative judge handoff, ablation variants, and Anthropic-compatible grading.json exports.

## What the benchmark manifest should cover

Core behavioral rules to test with the with_skill vs without_skill paired comparison:

1. Implementation intent gate (#810) -- does directive prevent the agent from writing code before a spec/vBRIEF exists?
   Assertion: without_skill agent starts coding; with_skill agent asks for a plan first.

2. Branch policy -- does directive prevent the agent from committing directly to main?
   Assertion: with_skill output contains branch creation; without_skill output may not.

3. task check gate -- does directive ensure the agent runs task check before committing?
   Script assertion: check whether the agent invoked task check in its output.

4. Anomaly-first summaries -- does the anomaly-first summary rule in main.md change agent PR descriptions?
   Judge assertion with rubric: does the summary lead with risks/anomalies before good news?

5. vBRIEF lifecycle compliance -- does directive keep scope vBRIEFs in the correct lifecycle folder?
   File assertion: vbrief/active/ contains the expected file after activation.

## Required adapter

Directive is interactive-session-first. A non-interactive runner script is needed for step 3 of the harness workflow: invoke claude -p (or equivalent) per task, write output.md and metadata.json to the expected path. This is a one-time setup, not a harness change.

## Proposed files

- evals/shared-benchmark.json -- benchmark manifest with tune cases for the five behavioral rules above
- evals/fixtures/ -- fixture files for cases that need them (diff patches, repo snapshots)
- scripts/run-evals.sh -- thin runner script: reads tasks.jsonl, invokes agent, writes outputs
- Taskfile.yml -- new task eval:benchmark: runs the full prepare/run/grade/render pipeline

## Acceptance criteria

- evals/shared-benchmark.json passes skill-benchmark validate
- At least 5 tune cases covering the behavioral rules above
- with_skill shows measurable lift over without_skill on at least 3 of the 5 cases
- task eval:benchmark runs the full pipeline end to end
- benchmark.json is produced and includes pass rates per case

## Related

- Extends #742 (ADR-001 empirical validation -- this is the implementation path for the comprehension parity gate)
- The deft-agent-bench repo referenced in the ADR-001 vBRIEF is adjacent infrastructure; this issue is directive's own benchmark manifest
- Related to A1.2 (ablation suite) -- the ablations array extends this manifest once the base cases are established
- Related to A1.3 (trigger coverage) -- skill-pi-trigger-eval is a companion eval to this manifest

## Source

https://github.com/adewale/skill-eval-harness -- Python CLI for evaluating agent skills with paired with_skill/without_skill variants, anti-overfitting splits, repeated-run statistics, and deterministic grading. README: "Use it when you want to answer: does this skill improve the agent, where does it fail, and is the eval itself discriminating?"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest with with_skill vs without_skill paired cases #1584

Summary

The harness fit

What the benchmark manifest should cover

Required adapter

Proposed files

Acceptance criteria

Related

Source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest with with_skill vs without_skill paired cases #1584

Description

Summary

The harness fit

What the benchmark manifest should cover

Required adapter

Proposed files

Acceptance criteria

Related

Source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions