docs(patterns): evals/shared-benchmark.json as recommended artifact for directive-created projects with agent behavior

docs(patterns): evals/shared-benchmark.json as a recommended artifact for directive-created projects with agent behavior

## Summary

Projects directive helps create that involve agent behavior -- skills, AGENTS.md customizations, prompt engineering, LLM-driven workflows -- have no standard pattern for testing whether those behaviors actually work. The skill-eval-harness manifest format (evals/shared-benchmark.json) is the missing artifact: it gives any project a concrete, versioned way to test agent behavior with paired with_skill/without_skill comparisons, anti-overfitting splits, and deterministic grading.

Directive should recommend this as a standard project artifact for any project with agent-facing behavior, analogous to how tests/ is standard for any project with code.

## The pattern

Just as directive recommends:
- vbrief/ for planning and scope tracking
- tests/ with >=85% coverage for code correctness
- task check as the pre-commit gate

It should also recommend:
- evals/shared-benchmark.json for agent behavior correctness

The ownership model is: each project owns its own manifest. The harness is external tooling (uv tool install), not a dependency to vendor.

## Where this belongs in directive

patterns/ai-product-authority.md (#1397) already establishes the six-component audit for AI products. The Evaluation component of that audit is: "how output gets scored against business rules (not benchmarks)." evals/shared-benchmark.json is the concrete artifact that fills the Evaluation component for agent-behavior-heavy projects.

A new section in patterns/ai-product-authority.md or a companion patterns/agent-evals.md should cover:
- When to own a benchmark manifest (any project where agent behavior is a product feature)
- The with_skill/without_skill comparison as the minimum viable eval
- The three-split pattern (tune/holdout/holdback) and why holdback matters
- The assertion types and when to use judge vs. objective assertions
- The task eval:benchmark entry point pattern

## Affected files

- patterns/ai-product-authority.md (#1397) -- add evals/shared-benchmark.json as the Evaluation component artifact
- New: patterns/agent-evals.md -- full pattern documentation with manifest format, split policy, assertion guidance, and harness install instructions
- docs/CONCEPTS.md -- reference agent evals as a testing practice alongside unit/integration tests

## Acceptance criteria

- patterns/ai-product-authority.md references evals/shared-benchmark.json as the Evaluation component artifact
- patterns/agent-evals.md or equivalent exists with the manifest pattern documented
- A project using directive to build agent-facing features has a clear recommended path to eval coverage
- The documentation references skill-eval-harness as the recommended external harness

## Related

- Depends on #1584 (directive own benchmark manifest -- directive dogfoods this pattern itself)
- Extends #1397 (AI product authority -- this closes the Evaluation component gap)
- Related to #1395 (vbrief:verify-ac -- AC verification is spec-level; agent evals are behavior-level; both are needed)

## Source

https://github.com/adewale/skill-eval-harness: "Each skill repo owns an evals/shared-benchmark.json manifest" -- the ownership model where each project maintains its own benchmark.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(patterns): evals/shared-benchmark.json as recommended artifact for directive-created projects with agent behavior #1587

Summary

The pattern

Where this belongs in directive

Affected files

Acceptance criteria

Related

Source

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

docs(patterns): evals/shared-benchmark.json as recommended artifact for directive-created projects with agent behavior #1587

Description

Summary

The pattern

Where this belongs in directive

Affected files

Acceptance criteria

Related

Source

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions