Skip to content

Phase 4: Benchmark evidence and regression gates #518

Description

@justn-hyeok

Objective

Make quality claims evidence-based instead of anecdotal.

Work

  • Maintain a labeled benchmark corpus with recall fixtures and false-positive regression fixtures.
  • Include security bugs, logic bugs, API-contract bugs, clean diffs, generated/noisy files, large diffs, and multi-file interaction cases.
  • Track TP, FP, FN, precision, recall, F1, recall@3/@5/@10, FP clean-rate, latency, and cost when available.
  • Keep ambiguous-case research separate from production pass/fail gates.
  • Require benchmark deltas before changing severity semantics, L2/L3 behavior, model pools, or large-diff strategy.

Acceptance Gate

  • Benchmark methodology and latest production candidate results are published in docs.
  • High-severity seeded findings meet the agreed recall threshold before release.
  • FP regression fixtures remain clean within the agreed tolerance.
  • No accepted production candidate fabricates files, line ranges, or code quotes in benchmark output.
  • CI blocks benchmark schema regressions and flags material quality regressions.

Source: docs/PRODUCTION_READINESS_ROADMAP.md Phase 4.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions