docs(patterns): evals/shared-benchmark.json as a recommended artifact for directive-created projects with agent behavior
Summary
Projects directive helps create that involve agent behavior -- skills, AGENTS.md customizations, prompt engineering, LLM-driven workflows -- have no standard pattern for testing whether those behaviors actually work. The skill-eval-harness manifest format (evals/shared-benchmark.json) is the missing artifact: it gives any project a concrete, versioned way to test agent behavior with paired with_skill/without_skill comparisons, anti-overfitting splits, and deterministic grading.
Directive should recommend this as a standard project artifact for any project with agent-facing behavior, analogous to how tests/ is standard for any project with code.
The pattern
Just as directive recommends:
- vbrief/ for planning and scope tracking
- tests/ with >=85% coverage for code correctness
- task check as the pre-commit gate
It should also recommend:
- evals/shared-benchmark.json for agent behavior correctness
The ownership model is: each project owns its own manifest. The harness is external tooling (uv tool install), not a dependency to vendor.
Where this belongs in directive
patterns/ai-product-authority.md (#1397) already establishes the six-component audit for AI products. The Evaluation component of that audit is: "how output gets scored against business rules (not benchmarks)." evals/shared-benchmark.json is the concrete artifact that fills the Evaluation component for agent-behavior-heavy projects.
A new section in patterns/ai-product-authority.md or a companion patterns/agent-evals.md should cover:
- When to own a benchmark manifest (any project where agent behavior is a product feature)
- The with_skill/without_skill comparison as the minimum viable eval
- The three-split pattern (tune/holdout/holdback) and why holdback matters
- The assertion types and when to use judge vs. objective assertions
- The task eval:benchmark entry point pattern
Affected files
Acceptance criteria
- patterns/ai-product-authority.md references evals/shared-benchmark.json as the Evaluation component artifact
- patterns/agent-evals.md or equivalent exists with the manifest pattern documented
- A project using directive to build agent-facing features has a clear recommended path to eval coverage
- The documentation references skill-eval-harness as the recommended external harness
Related
Source
https://github.com/adewale/skill-eval-harness: "Each skill repo owns an evals/shared-benchmark.json manifest" -- the ownership model where each project maintains its own benchmark.
docs(patterns): evals/shared-benchmark.json as a recommended artifact for directive-created projects with agent behavior
Summary
Projects directive helps create that involve agent behavior -- skills, AGENTS.md customizations, prompt engineering, LLM-driven workflows -- have no standard pattern for testing whether those behaviors actually work. The skill-eval-harness manifest format (evals/shared-benchmark.json) is the missing artifact: it gives any project a concrete, versioned way to test agent behavior with paired with_skill/without_skill comparisons, anti-overfitting splits, and deterministic grading.
Directive should recommend this as a standard project artifact for any project with agent-facing behavior, analogous to how tests/ is standard for any project with code.
The pattern
Just as directive recommends:
It should also recommend:
The ownership model is: each project owns its own manifest. The harness is external tooling (uv tool install), not a dependency to vendor.
Where this belongs in directive
patterns/ai-product-authority.md (#1397) already establishes the six-component audit for AI products. The Evaluation component of that audit is: "how output gets scored against business rules (not benchmarks)." evals/shared-benchmark.json is the concrete artifact that fills the Evaluation component for agent-behavior-heavy projects.
A new section in patterns/ai-product-authority.md or a companion patterns/agent-evals.md should cover:
Affected files
Acceptance criteria
Related
Source
https://github.com/adewale/skill-eval-harness: "Each skill repo owns an evals/shared-benchmark.json manifest" -- the ownership model where each project maintains its own benchmark.