Objective
Make quality claims evidence-based instead of anecdotal.
Work
- Maintain a labeled benchmark corpus with recall fixtures and false-positive regression fixtures.
- Include security bugs, logic bugs, API-contract bugs, clean diffs, generated/noisy files, large diffs, and multi-file interaction cases.
- Track TP, FP, FN, precision, recall, F1, recall@3/@5/@10, FP clean-rate, latency, and cost when available.
- Keep ambiguous-case research separate from production pass/fail gates.
- Require benchmark deltas before changing severity semantics, L2/L3 behavior, model pools, or large-diff strategy.
Acceptance Gate
- Benchmark methodology and latest production candidate results are published in docs.
- High-severity seeded findings meet the agreed recall threshold before release.
- FP regression fixtures remain clean within the agreed tolerance.
- No accepted production candidate fabricates files, line ranges, or code quotes in benchmark output.
- CI blocks benchmark schema regressions and flags material quality regressions.
Source: docs/PRODUCTION_READINESS_ROADMAP.md Phase 4.
Objective
Make quality claims evidence-based instead of anecdotal.
Work
Acceptance Gate
Source:
docs/PRODUCTION_READINESS_ROADMAP.mdPhase 4.