Skip to content

feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest with with_skill vs without_skill paired cases #1584

@visionik

Description

@visionik

feat(evals): evals/shared-benchmark.json — directive framework benchmark manifest using skill-eval-harness

Summary

Directive has no empirical evidence that it improves agent behavior. The ADR-001 empirical validation vBRIEF (2026-05-01-742-adr-001-empirical-validation) references deft-agent-bench as planned benchmark infrastructure, but no benchmark has shipped. This issue tracks creating a concrete benchmark manifest at evals/shared-benchmark.json that tests directive's core behavioral rules using skill-eval-harness (https://github.com/adewale/skill-eval-harness).

The harness fit

skill-eval-harness is runner-agnostic: it prepares tasks as JSONL, expects outputs in a standard directory layout, then grades and aggregates. The with_skill vs without_skill paired variant maps directly to the directive evaluation question:

  • without_skill: agent receives the task prompt with no AGENTS.md in the working directory
  • with_skill: agent runs in a directive-configured directory where AGENTS.md loads main.md and the relevant skills

The harness supports: repeated-run statistics, holdout/holdback anti-overfitting splits, objective assertions (contains, regex, file_exists, script), qualitative judge handoff, ablation variants, and Anthropic-compatible grading.json exports.

What the benchmark manifest should cover

Core behavioral rules to test with the with_skill vs without_skill paired comparison:

  1. Implementation intent gate (Implementation-intent inference is non-deterministic: agents spawn coding sub-agents from lifecycle/PR-process language without explicit action-verb authorization #810) -- does directive prevent the agent from writing code before a spec/vBRIEF exists?
    Assertion: without_skill agent starts coding; with_skill agent asks for a plan first.

  2. Branch policy -- does directive prevent the agent from committing directly to main?
    Assertion: with_skill output contains branch creation; without_skill output may not.

  3. task check gate -- does directive ensure the agent runs task check before committing?
    Script assertion: check whether the agent invoked task check in its output.

  4. Anomaly-first summaries -- does the anomaly-first summary rule in main.md change agent PR descriptions?
    Judge assertion with rubric: does the summary lead with risks/anomalies before good news?

  5. vBRIEF lifecycle compliance -- does directive keep scope vBRIEFs in the correct lifecycle folder?
    File assertion: vbrief/active/ contains the expected file after activation.

Required adapter

Directive is interactive-session-first. A non-interactive runner script is needed for step 3 of the harness workflow: invoke claude -p (or equivalent) per task, write output.md and metadata.json to the expected path. This is a one-time setup, not a harness change.

Proposed files

  • evals/shared-benchmark.json -- benchmark manifest with tune cases for the five behavioral rules above
  • evals/fixtures/ -- fixture files for cases that need them (diff patches, repo snapshots)
  • scripts/run-evals.sh -- thin runner script: reads tasks.jsonl, invokes agent, writes outputs
  • Taskfile.yml -- new task eval:benchmark: runs the full prepare/run/grade/render pipeline

Acceptance criteria

  • evals/shared-benchmark.json passes skill-benchmark validate
  • At least 5 tune cases covering the behavioral rules above
  • with_skill shows measurable lift over without_skill on at least 3 of the 5 cases
  • task eval:benchmark runs the full pipeline end to end
  • benchmark.json is produced and includes pass rates per case

Related

Source

https://github.com/adewale/skill-eval-harness -- Python CLI for evaluating agent skills with paired with_skill/without_skill variants, anti-overfitting splits, repeated-run statistics, and deterministic grading. README: "Use it when you want to answer: does this skill improve the agent, where does it fail, and is the eval itself discriminating?"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions