Phase 4: Benchmark evidence and regression gates

## Objective
Make quality claims evidence-based instead of anecdotal.

## Work
- Maintain a labeled benchmark corpus with recall fixtures and false-positive regression fixtures.
- Include security bugs, logic bugs, API-contract bugs, clean diffs, generated/noisy files, large diffs, and multi-file interaction cases.
- Track TP, FP, FN, precision, recall, F1, recall@3/@5/@10, FP clean-rate, latency, and cost when available.
- Keep ambiguous-case research separate from production pass/fail gates.
- Require benchmark deltas before changing severity semantics, L2/L3 behavior, model pools, or large-diff strategy.

## Acceptance Gate
- Benchmark methodology and latest production candidate results are published in docs.
- High-severity seeded findings meet the agreed recall threshold before release.
- FP regression fixtures remain clean within the agreed tolerance.
- No accepted production candidate fabricates files, line ranges, or code quotes in benchmark output.
- CI blocks benchmark schema regressions and flags material quality regressions.

Source: `docs/PRODUCTION_READINESS_ROADMAP.md` Phase 4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase 4: Benchmark evidence and regression gates #518

Objective

Work

Acceptance Gate

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Phase 4: Benchmark evidence and regression gates #518

Description

Objective

Work

Acceptance Gate

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions