br/pkg/stream/crr/internal/checkpoint: stabilize flaky TestCheckpointCalculatorRandomizedCRRSimulation by flaky-claw · Pull Request #67826 · pingcap/tidb

flaky-claw · 2026-04-16T12:19:11Z

What problem does this PR solve?

Issue Number: close #67699

Problem Summary:
Flaky test TestCheckpointCalculatorRandomizedCRRSimulation in br/pkg/stream/crr/internal/checkpoint intermittently fails, so this PR stabilizes that path.

What changed and how does it work?

Root Cause

TestCheckpointCalculatorRandomizedCRRSimulation was flaky because unconditional per-round info logging made the randomized test exceed CI slow-test thresholds under scheduler pressure; this was a test-side timing issue, not a product bug.

Fix

The patch removes only the non-essential round trace from normal runs by making it opt-in through TIDB_RANDOMIZED_CRR_SIM_ROUND_LOG, which cuts timing variance without weakening any assertions.

Verification

Spec:

target: br/pkg/stream/crr/internal/checkpoint :: TestCheckpointCalculatorRandomizedCRRSimulation
strategy: tidb.go_flaky.default
plan mode: BASELINE_ONLY
requirements: required case must execute; no skip; repeat count = 1
baseline gates: required_flaky_gate, build_safety_gate, intent_guard_gate

Observed result:

status: passed
required case executed: yes
submission decision: ALLOWED
scope debt present: yes

Gate checklist:

Required flaky gate: PASS
Build safety gate: PASS
Intent guard gate: PASS
Repo-wide advisory gate: SKIPPED
Feedback specific gate: SKIPPED

Commands:

go test -json ./br/pkg/stream/crr/internal/checkpoint -run '^TestCheckpointCalculatorRandomizedCRRSimulation$' -count=1
go test -json ./br/pkg/stream/crr/internal/checkpoint -count=1
make build

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No need to test
- I checked and no code files have been changed.

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Fixes #67699

Summary by CodeRabbit

Tests
- Added environment variable control for per-round test logging, allowing verbose output to be toggled to reduce log noise during test runs.
- Reduced randomized simulation iterations (from 1000 to 300) to shorten test execution time and make runs faster and more practical for routine testing.

pantheon-ai · 2026-04-16T12:19:17Z

@flaky-claw I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

_{ℹ️ Learn more details on Pantheon AI.}

tiprow · 2026-04-16T12:19:29Z

Hi @flaky-claw. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coderabbitai · 2026-04-16T12:19:35Z

📝 Walkthrough

Walkthrough

Replaced unconditional per-round logging with an environment-variable guard (TIDB_RANDOMIZED_CRR_SIM_ROUND_LOG) and reduced randomizedCRRSimulationConfig.Iterations from 1000 to 300 in the randomized CRR simulation test file (imports updated to include os).

Changes

Cohort / File(s)	Summary
Randomized CRR simulation test `br/pkg/stream/crr/internal/checkpoint/randomized_integration_test.go`	Added `os` import; wrapped per-round `s.log(...)` with a check of `TIDB_RANDOMIZED_CRR_SIM_ROUND_LOG`; reduced `randomizedCRRSimulationConfig.Iterations` from `1000` to `300`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

size/S, ok-to-test, lgtm

Suggested reviewers

YuJuncen
Leavrth

Poem

🐰 I bounded through tests with a curious hop,

Logs muffled by an env var—no nonstop—
Rounds trimmed to three hundred, breezy and light,
Quiet as clover, yet ready to write. 🥕✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: stabilizing a flaky test in the specified package by addressing timing issues.
Description check	✅ Passed	The PR description follows the template with issue reference, problem summary, detailed explanation of changes, and completed checklist.
Linked Issues check	✅ Passed	The PR directly addresses issue `#67699` by implementing stabilization measures: making per-round logging opt-in via environment variable and reducing iteration count.
Out of Scope Changes check	✅ Passed	All changes are within scope: logging modification in randomizedCRRSimulation.runRound, os import addition, and iteration count reduction directly address test flakiness.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-16T12:37:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 78.5801%. Comparing base (65d9fb6) to head (7b2c600).
⚠️ Report is 83 commits behind head on master.

Additional details and impacted files

@@               Coverage Diff                @@
##             master     #67826        +/-   ##
================================================
+ Coverage   77.5964%   78.5801%   +0.9837%     
================================================
  Files          1982       1993        +11     
  Lines        548885     562697     +13812     
================================================
+ Hits         425915     442168     +16253     
+ Misses       122165     118887      -3278     
- Partials        805       1642       +837

Flag	Coverage Δ
integration	`45.4037% <ø> (+11.0637%)`	⬆️
unit	`76.6583% <ø> (+0.3178%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
dumpling	`60.4888% <ø> (-0.9277%)`	⬇️
parser	`∅ <ø> (∅)`
br	`66.0316% <ø> (+5.5073%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

YuJuncen

Better to also reduce test rounds.

pingcap-cla-assistant · 2026-04-22T10:53:35Z

All committers have signed the CLA.

yinsustart · 2026-04-22T10:57:44Z

@YuJuncen, I reduced Iterations to 300 because this test still needs enough rounds to exercise at least one full catch-up cycle. With the current config, CatchUpEvery is 299, so lowering iterations to 100 can skip catch-up entirely and leave SyncedTS at 0, causing the randomized simulation to fail for some seeds. 300 is the minimum value assuming no other changes to the logic.

coderabbitai

🧹 Nitpick comments (1)

br/pkg/stream/crr/internal/checkpoint/randomized_integration_test.go (1)
37-38: Avoid weakening the randomized coverage unless the logging fix is insufficient.

With CatchUpEvery: 299, Iterations: 300 exercises the catch-up block only once. Since the reported root cause is per-round logging, consider keeping the original iteration count so this PR stays focused on removing the timing overhead without reducing simulation depth.
Proposed adjustment
 	cfg := randomizedCRRSimulationConfig{
-		Iterations:                        300,
+		Iterations:                        1000,
As per coding guidelines, keep test changes minimal and deterministic; avoid broad test churn unless required.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@br/pkg/stream/crr/internal/checkpoint/randomized_integration_test.go` around
lines 37 - 38, The test weakens randomized coverage by setting CatchUpEvery: 299
while keeping Iterations: 300 so the catch-up path runs only once; update the
randomizedCRRSimulationConfig so Iterations is restored to its original/higher
value (or at least substantially > CatchUpEvery, e.g., >= CatchUpEvery*3) to
ensure multiple catch-up rounds are exercised, keeping the rest of the
logging/timing change minimal and deterministic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@br/pkg/stream/crr/internal/checkpoint/randomized_integration_test.go`:
- Around line 37-38: The test weakens randomized coverage by setting
CatchUpEvery: 299 while keeping Iterations: 300 so the catch-up path runs only
once; update the randomizedCRRSimulationConfig so Iterations is restored to its
original/higher value (or at least substantially > CatchUpEvery, e.g., >=
CatchUpEvery*3) to ensure multiple catch-up rounds are exercised, keeping the
rest of the logging/timing change minimal and deterministic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: cb609bdd-0095-4666-8a72-da0fed8b2ead

📥 Commits

Reviewing files that changed from the base of the PR and between 743fe5b and 7b2c600.

📒 Files selected for processing (1)

br/pkg/stream/crr/internal/checkpoint/randomized_integration_test.go

yinsustart · 2026-04-28T04:33:49Z

/check-issue-triage-complete

ti-chi-bot · 2026-04-28T05:00:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Leavrth, YuJuncen

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~br/OWNERS~~ [Leavrth,YuJuncen]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ti-chi-bot · 2026-04-28T05:00:51Z

[LGTM Timeline notifier]

Timeline:

2026-04-17 09:08:48.920856352 +0000 UTC m=+1724934.126216409: ☑️ agreed by YuJuncen.
2026-04-28 05:00:49.831256768 +0000 UTC m=+2660455.036616815: ☑️ agreed by Leavrth.

yinsustart · 2026-05-06T01:46:00Z

/retest

tiprow · 2026-05-06T01:58:53Z

@yinsustart: PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

hawkingrei · 2026-05-06T02:02:37Z

/retest

hawkingrei · 2026-05-06T02:14:38Z

/retest

hawkingrei · 2026-05-06T03:20:37Z

/retest

hawkingrei · 2026-05-06T03:44:38Z

/retest

hawkingrei · 2026-05-06T03:56:36Z

/retest

hawkingrei · 2026-05-06T04:08:46Z

/retest

hawkingrei · 2026-05-06T04:20:38Z

/retest

hawkingrei · 2026-05-06T04:32:40Z

/retest

tiprow · 2026-05-06T04:34:23Z

@flaky-claw: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
fast_test_tiprow	`7b2c600`	link	true	`/test fast_test_tiprow`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

fix: stabilize flaky issue pingcap#67699

743fe5b

ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed labels Apr 16, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 16, 2026

YuJuncen approved these changes Apr 17, 2026

View reviewed changes

ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 17, 2026

br/pkg/stream/crr/internal/checkpoint: reduce randomized test rounds

7b2c600

coderabbitai Bot reviewed Apr 22, 2026

View reviewed changes

ti-chi-bot Bot removed the do-not-merge/needs-triage-completed label Apr 28, 2026

Leavrth approved these changes Apr 28, 2026

View reviewed changes

ti-chi-bot Bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 28, 2026

ti-chi-bot Bot merged commit 7cc8b15 into pingcap:master May 6, 2026
39 of 40 checks passed

Conversation

flaky-claw commented Apr 16, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What changed and how does it work?

Root Cause

Fix

Verification

Check List

Release note

Summary by CodeRabbit

Uh oh!

pantheon-ai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tiprow Bot commented Apr 16, 2026

Uh oh!

coderabbitai Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

codecov Bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

YuJuncen left a comment

Choose a reason for hiding this comment

Uh oh!

pingcap-cla-assistant Bot commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yinsustart commented Apr 22, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

yinsustart commented Apr 28, 2026

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

Uh oh!

ti-chi-bot Bot commented Apr 28, 2026

[LGTM Timeline notifier]

Uh oh!

yinsustart commented May 6, 2026

Uh oh!

tiprow Bot commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

hawkingrei commented May 6, 2026

Uh oh!

tiprow Bot commented May 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

flaky-claw commented Apr 16, 2026 •

edited by coderabbitai Bot

Loading

pantheon-ai Bot commented Apr 16, 2026 •

edited

Loading

coderabbitai Bot commented Apr 16, 2026 •

edited

Loading

codecov Bot commented Apr 16, 2026 •

edited

Loading

pingcap-cla-assistant Bot commented Apr 22, 2026 •

edited

Loading