Skip to content

executor: reduce TestDistSQLSharedKVRequestRace iterations to fix CI timeout#67675

Open
joechenrh wants to merge 3 commits intopingcap:masterfrom
joechenrh:fix-distsql-race-test-timeout
Open

executor: reduce TestDistSQLSharedKVRequestRace iterations to fix CI timeout#67675
joechenrh wants to merge 3 commits intopingcap:masterfrom
joechenrh:fix-distsql-race-test-timeout

Conversation

@joechenrh
Copy link
Copy Markdown
Contributor

@joechenrh joechenrh commented Apr 10, 2026

What problem does this PR solve?

Issue Number: ref xxx

Problem Summary:
TestDistSQLSharedKVRequestRace frequently times out in CI (pull_unit_test_next_gen). With the race detector enabled, the test runs 5 replica-read modes × 20 iterations × 2 queries = 200 queries on a partitioned table, taking ~278s on CI — dangerously close to the 5-minute Bazel "moderate" timeout. This causes flaky timeouts (example).

What changed and how does it work?

Reduce the inner loop iterations from 20 to 5 (total queries: 200 → 50). The race detector catches data races deterministically on first occurrence, and the RequestBuilder.used safety check (added in #61376) catches any builder-reuse regression even with a single iteration. 5 iterations is more than sufficient for confidence.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

None

Summary by CodeRabbit

  • Tests
    • Reduced iteration count in a distributed SQL test to improve test run performance while preserving the same SQL checks and validation logic.

…timeout

The test runs 5 replica-read modes × 20 iterations × 2 queries = 200
queries with the race detector enabled. On CI this takes ~278s, barely
under the 5-minute Bazel "moderate" timeout, causing flaky timeouts.

Reduce iterations from 20 to 5. The race detector catches data races
deterministically on first occurrence, and the RequestBuilder.used
safety check catches any reuse regression even with a single iteration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ti-chi-bot ti-chi-bot bot added the release-note-none Denotes a PR that doesn't merit a release note. label Apr 10, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai bot commented Apr 10, 2026

Review Complete

Findings: 0 issues
Posted: 0
Duplicates/Skipped: 0

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot bot added needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. labels Apr 10, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign dveeden for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 10, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 6914c4df-f924-4439-b86b-0e456ddf72ef

📥 Commits

Reviewing files that changed from the base of the PR and between e70180b and 39c3ec6.

📒 Files selected for processing (1)
  • pkg/executor/test/distsqltest/distsql_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/executor/test/distsqltest/distsql_test.go

📝 Walkthrough

Walkthrough

Reduced the iteration count in TestDistSQLSharedKVRequestRace from 20 to 5; test SQL statements and assertions remain unchanged.

Changes

Cohort / File(s) Summary
Test Optimization
pkg/executor/test/distsqltest/distsql_test.go
Lowered inner loop iterations in TestDistSQLSharedKVRequestRace from 20 to 5, reducing total query executions across replica read modes while keeping queries and checks identical.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Suggested labels

ok-to-test, approved, lgtm

Suggested reviewers

  • solotzg
  • gengliqi
  • wjhuang2016

Poem

🐰 Five hops now where twenty sprang,
Quiet paws and nimble wing,
The queries still bloom, assertions stay,
Faster runs through fields of May,
A rabbit cheers—small change, big zing! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: reducing test iterations to fix a CI timeout issue.
Description check ✅ Passed The description comprehensively addresses the template requirements with clear problem statement, detailed solution explanation, and all checklist items properly addressed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tiprow
Copy link
Copy Markdown

tiprow bot commented Apr 10, 2026

Hi @joechenrh. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Copy Markdown

@pantheon-ai pantheon-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Code looks good. No issues found.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.5996%. Comparing base (c2c1342) to head (39c3ec6).
⚠️ Report is 86 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67675        +/-   ##
================================================
- Coverage   77.8173%   77.5996%   -0.2178%     
================================================
  Files          2023       1965        -58     
  Lines        556183     557000       +817     
================================================
- Hits         432807     432230       -577     
- Misses       121632     124739      +3107     
+ Partials       1744         31      -1713     
Flag Coverage Δ
integration 40.9370% <ø> (-7.1898%) ⬇️
unit 76.6496% <ø> (+0.2799%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.5065% <ø> (ø)
parser ∅ <ø> (∅)
br 50.0915% <ø> (-10.7722%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ingress-bot
Copy link
Copy Markdown

🔍 Starting code review for this PR...

Copy link
Copy Markdown

@ingress-bot ingress-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 3
  • Inline comments: 3
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🟡 [Minor] (3)

  1. Race regression test lost replay coverage after loop-count reduction (pkg/executor/test/distsqltest/distsql_test.go:138)
  2. Race reproducer coverage is reduced by cutting loop repetitions (pkg/executor/test/distsqltest/distsql_test.go:138)
  3. Race-check loop bound changed without intent documentation (pkg/executor/test/distsqltest/distsql_test.go:138)

for _, mode := range replicaReadModes {
tk.MustExec(fmt.Sprintf("set session tidb_replica_read = '%s'", mode))
for i := 0; i < 20; i++ {
for i := 0; i < 5; i++ {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] Race regression test lost replay coverage after loop-count reduction

Impact
TestDistSQLSharedKVRequestRace now runs each replica-read mode 5 times instead of 20, reducing stress executions from 200 to 50 across the two query paths.
This reduces replay/retry sampling for a schedule-sensitive race regression, so repeated runs no longer provide the prior detection confidence.

Scope

  • pkg/executor/test/distsqltest/distsql_test.go:138TestDistSQLSharedKVRequestRace

Evidence
The changed loop bound is for i := 0; i < 5; i++, replacing the previous 20-iteration stress loop in TestDistSQLSharedKVRequestRace.
That loop wraps both force index(ic) and index-merge query checks for every tidb_replica_read mode, so each mode now executes only one quarter of the previous repetition count.

Change request
Restore the stress loop count to the previous level, or introduce an explicit deterministic stress knob with documented rationale for lower coverage.
Keep per-mode repeated executions high enough that race detection remains stable across reruns and scheduler variance.

for _, mode := range replicaReadModes {
tk.MustExec(fmt.Sprintf("set session tidb_replica_read = '%s'", mode))
for i := 0; i < 20; i++ {
for i := 0; i < 5; i++ {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] Race reproducer coverage is reduced by cutting loop repetitions

Impact
TestDistSQLSharedKVRequestRace is the regression guard for shared kv.Request race behavior from issue 60175, and this patch cuts repeated executions from 20 to 5 per replica-read mode.
The reduced repetition lowers scheduler interleaving coverage, allowing low-frequency race regressions to pass this guard.

Scope

  • pkg/executor/test/distsqltest/distsql_test.go:138TestDistSQLSharedKVRequestRace

Evidence
In TestDistSQLSharedKVRequestRace, the inner loop now runs 5 iterations instead of 20 while executing the same two query paths each round.
The function comment marks this test as the regression check for https://github.com/pingcap/tidb/issues/60175, so reducing only the repetition count removes stress coverage without adding a deterministic trigger.

Change request
Restore the previous repetition budget or replace it with a deterministic concurrency trigger that guarantees the race window is exercised on every run.
If test time is the concern, keep a fast-path count here only with an additional stress variant that preserves equivalent race-detection strength in CI.

for _, mode := range replicaReadModes {
tk.MustExec(fmt.Sprintf("set session tidb_replica_read = '%s'", mode))
for i := 0; i < 20; i++ {
for i := 0; i < 5; i++ {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] Race-check loop bound changed without intent documentation

Impact
TestDistSQLSharedKVRequestRace is explicitly tied to issue 60175, but the new 5 iteration bound is an unexplained magic number in a stress-style test.
Without rationale for why this bound is sufficient, later edits can keep shrinking or reshaping the probe while the test name still implies strong race-coverage intent.

Scope

  • pkg/executor/test/distsqltest/distsql_test.go:138TestDistSQLSharedKVRequestRace

Evidence
The diff changes the inner loop from for i := 0; i < 20; i++ to for i := 0; i < 5; i++ at line 138.
The nearby comments only label query forms (index lookup and index merge) and do not document the expected repetition invariant or the tradeoff behind the new bound.

Change request
Introduce an intent-revealing constant name for this loop bound and add a short comment explaining why the chosen count is sufficient for the 60175 regression guard.
Document the accepted coverage/performance tradeoff at this line so future maintainers can adjust it without guesswork.

@joechenrh joechenrh removed needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. labels Apr 10, 2026
joechenrh and others added 2 commits April 10, 2026 12:44
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 10, 2026

[FORMAT CHECKER NOTIFICATION]

Notice: To remove the do-not-merge/needs-linked-issue label, please provide the linked issue number on one line in the PR body, for example: Issue Number: close #123 or Issue Number: ref #456.

📖 For more info, you can check the "Contribute Code" section in the development guide.

@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot bot commented Apr 10, 2026

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-integration-realcluster-test-next-gen 39c3ec6 link true /test pull-integration-realcluster-test-next-gen

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants