Skip to content

docs(patterns): evals/shared-benchmark.json as recommended artifact for directive-created projects with agent behavior #1587

@visionik

Description

@visionik

docs(patterns): evals/shared-benchmark.json as a recommended artifact for directive-created projects with agent behavior

Summary

Projects directive helps create that involve agent behavior -- skills, AGENTS.md customizations, prompt engineering, LLM-driven workflows -- have no standard pattern for testing whether those behaviors actually work. The skill-eval-harness manifest format (evals/shared-benchmark.json) is the missing artifact: it gives any project a concrete, versioned way to test agent behavior with paired with_skill/without_skill comparisons, anti-overfitting splits, and deterministic grading.

Directive should recommend this as a standard project artifact for any project with agent-facing behavior, analogous to how tests/ is standard for any project with code.

The pattern

Just as directive recommends:

  • vbrief/ for planning and scope tracking
  • tests/ with >=85% coverage for code correctness
  • task check as the pre-commit gate

It should also recommend:

  • evals/shared-benchmark.json for agent behavior correctness

The ownership model is: each project owns its own manifest. The harness is external tooling (uv tool install), not a dependency to vendor.

Where this belongs in directive

patterns/ai-product-authority.md (#1397) already establishes the six-component audit for AI products. The Evaluation component of that audit is: "how output gets scored against business rules (not benchmarks)." evals/shared-benchmark.json is the concrete artifact that fills the Evaluation component for agent-behavior-heavy projects.

A new section in patterns/ai-product-authority.md or a companion patterns/agent-evals.md should cover:

  • When to own a benchmark manifest (any project where agent behavior is a product feature)
  • The with_skill/without_skill comparison as the minimum viable eval
  • The three-split pattern (tune/holdout/holdback) and why holdback matters
  • The assertion types and when to use judge vs. objective assertions
  • The task eval:benchmark entry point pattern

Affected files

Acceptance criteria

  • patterns/ai-product-authority.md references evals/shared-benchmark.json as the Evaluation component artifact
  • patterns/agent-evals.md or equivalent exists with the manifest pattern documented
  • A project using directive to build agent-facing features has a clear recommended path to eval coverage
  • The documentation references skill-eval-harness as the recommended external harness

Related

Source

https://github.com/adewale/skill-eval-harness: "Each skill repo owns an evals/shared-benchmark.json manifest" -- the ownership model where each project maintains its own benchmark.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions