Skip to content

br: pre-lease log-backup storage and readLock leak fixes#67850

Open
RidRisR wants to merge 3 commits intopingcap:masterfrom
RidRisR:bugfix/merge-migrations-preserves-ingested-sst-paths
Open

br: pre-lease log-backup storage and readLock leak fixes#67850
RidRisR wants to merge 3 commits intopingcap:masterfrom
RidRisR:bugfix/merge-migrations-preserves-ingested-sst-paths

Conversation

@RidRisR
Copy link
Copy Markdown
Contributor

@RidRisR RidRisR commented Apr 17, 2026

What problem does this PR solve?

Issue Number: close #67819

Problem Summary:

This PR bundles two storage-leak fixes in the log-backup subsystem that must land before the lease-based lock expiration work can proceed. Both surface naturally during the same audit of what goes wrong when a process holding a log-backup lock fails to tear down cleanly.

Fix 1 — MergeMigrations drops layer IngestedSstPaths (br/pkg/stream/stream_metas.go)

MergeMigrations() appends m1.GetIngestedSstPaths() but silently drops m2.GetIngestedSstPaths(). The other repeated fields in the same function (Compactions, DestructPrefix) concat both sides, so this is an asymmetric copy-paste omission.

  • MergeAndMigrateTo (called during br log truncate) invokes MergeMigrations to fold each layer into the new BASE. Because m2's IngestedSstPaths are dropped, processExtFullBackup never sees layer-originated paths and cannot evaluate them against the Finished && GroupTS < TruncatedTo cleanup condition.
  • Every layer's corresponding v1/ext_backups/ directory becomes a permanent orphan on the log-backup storage — a steady-state storage leak.
  • PiTR's read path reads IngestedSstPaths per-layer via ListAll() (see br/pkg/restore/log_client/migration.go), so data consumption is unaffected today.
  • Under the forthcoming lease-based lock expiration mechanism, if a read-lock is auto-reclaimed before PiTR consumes these SSTs, a subsequent truncate's merge will silently drop the references — escalating this from "storage leak" to "PiTR cannot find required SSTs" (correctness issue).

Fix 2 — GetLockedMigrations leaks readLock on Load() error (br/pkg/restore/log_client/client.go)

GetLockedMigrations() acquires a read lock on v1/LOCK via GetReadLock, then calls ext.Load(ctx) to enumerate migrations. When Load() fails — for instance because a .mgrt file is corrupted, the migrations directory is transiently unreadable, or migIdOf cannot parse a name — the function returns the error without releasing the lock it just acquired, orphaning v1/LOCK.READ.* on remote storage until it is manually removed.

Once lease-based auto-reclamation lands, this path is worse: a caller that never receives the RemoteLock cannot register a renewal, so the orphaned lock never refreshes its ExpireAt — and the CLI escape hatch becomes the only recourse. Closing the leak now keeps cleanup honest even under today's no-expiration regime.

What changed and how does it work?

Fix 1: one-line append of m2.GetIngestedSstPaths() in MergeMigrations, symmetric with Compactions / DestructPrefix handling.

Behavior change note: after this fix, the first br log truncate run against a storage that previously suffered the drop will start correctly evaluating layer-originated IngestedSstPaths against processExtFullBackup's existing conditions (unchanged logic). Directories satisfying Finished=true && GroupTS < TruncatedTo get cleaned; others get carried into the new BASE. This does not retroactively recover orphan directories whose references were already lost in a historical merge — a separate scan tool is a follow-up independent task.

Other call sites of MergeMigrations audited:

  • MergeAndMigrateTo (stream_metas.go ~L940): the target path — now cleans up properly.
  • MergeToBy (stream_metas.go ~L820): only reachable through MergeTo which has no production callers today.
  • doTruncateLogs (stream_metas.go ~L1479): both r.NewBase and aOut.NewBase have empty IngestedSstPaths at this site — fix has no effect here.

Fix 2: add readLock.UnlockOnCleanUp(ctx) on the error path. Happy path unchanged.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

New tests:

  • TestMergeMigrationsPreservesIngestedSstPaths — direct primitive-level test; verified to FAIL on unfixed code, PASS after fix.
  • TestMergeAndMigrateToBoundsIngestedSstPathsOverTruncates — 10-round integration test proving that after the MergeMigrations fix, processExtFullBackup's pruning pipeline is reconnected: BASE's IngestedSstPaths stays bounded (≤1) and stale ext_backups/ directories are physically deleted. Also failed against unfixed code, passed after fix.
  • TestGetLockedMigrationsReleasesReadLockOnLoadError — injects a malformed migration file to force Load() failure, asserts no v1/LOCK.READ.* remains on storage. Failed against unfixed code (lingering v1/LOCK.READ.{random}), passed after fix.

Full br/pkg/stream and br/pkg/restore/log_client package sweeps (go test -tags=intest with failpoints enabled) stay green.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

br: fix two storage leaks in log backup — (1) `MergeMigrations` silently dropped `IngestedSstPaths` from layers, leaking `v1/ext_backups/` directories across `br log truncate` runs; (2) `GetLockedMigrations` orphaned the `v1/LOCK.READ.*` read lock when the migration `Load()` step failed. Both paths now clean up correctly.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed migration merging to correctly preserve all ingested SST path entries from source migrations; previously some paths were being lost during the merge process.
  • Tests

    • Added comprehensive test coverage for ingested SST path handling during migration operations and cleanup procedures.
    • Added test coverage for proper error handling and resource recovery in backup/restore operations.

@ti-chi-bot ti-chi-bot Bot added do-not-merge/needs-triage-completed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Apr 17, 2026
@pantheon-ai
Copy link
Copy Markdown

pantheon-ai Bot commented Apr 17, 2026

@RidRisR I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details.

⏳ This process typically takes 10-30 minutes depending on the complexity of the changes.

ℹ️ Learn more details on Pantheon AI.

@ti-chi-bot ti-chi-bot Bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 17, 2026
@tiprow
Copy link
Copy Markdown

tiprow Bot commented Apr 17, 2026

Hi @RidRisR. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 11874803-a6c7-4bdf-a0fe-f1cc8649b483

📥 Commits

Reviewing files that changed from the base of the PR and between 3007523 and 72d2b7f.

📒 Files selected for processing (3)
  • br/pkg/restore/log_client/client.go
  • br/pkg/restore/log_client/client_test.go
  • br/pkg/restore/log_client/export_test.go

📝 Walkthrough

Walkthrough

Fix MergeMigrations to preserve IngestedSstPaths from both input migrations; add tests validating SST-path preservation and cleanup across truncation rounds; update LogClient.GetLockedMigrations to use named returns and release read lock on load errors; add a storage-only test helper and a lock-release test.

Changes

Cohort / File(s) Summary
Stream merge fix
br/pkg/stream/stream_metas.go
Preserve IngestedSstPaths from both m1 and m2 when merging migrations (append m2's paths as well).
Stream tests
br/pkg/stream/stream_metas_test.go
Added tests: TestMergeMigrationsPreservesIngestedSstPaths and TestMergeAndMigrateToBoundsIngestedSstPathsOverTruncates to validate SST-path preservation and cleanup behavior.
Log client (implementation)
br/pkg/restore/log_client/client.go
Changed GetLockedMigrations signature to named returns and added deferred unlock that releases read lock on error paths; simplified return construction.
Log client tests & helpers
br/pkg/restore/log_client/client_test.go, br/pkg/restore/log_client/export_test.go
Added TestGetLockedMigrationsReleasesReadLockOnLoadError that verifies locks are cleaned up on load failure; added TEST_NewLogClientWithStorage test helper to construct a storage-only LogClient.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

component/br, component/test, size/M, ok-to-test

Suggested reviewers

  • YuJuncen
  • D3Hunter

Poem

🐇 I hopped through migrations, one and two,
I stitched their paths so none are lost, woohoo!
Locks now let go when errors appear,
Tests keep watch, and cleanup's clear.
A rabbit's cheer for fixes true! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'br: pre-lease log-backup storage and readLock leak fixes' is clear and specific, accurately describing the main changes: two storage leak fixes in log-backup subsystem.
Description check ✅ Passed The description fully complies with the template, providing issue number, detailed problem summary for both fixes, implementation details, comprehensive test coverage explanation, checklist completion, and a well-formed release note.
Linked Issues check ✅ Passed The PR directly addresses issue #67819 by implementing both required fixes: (1) appending m2.IngestedSstPaths in MergeMigrations to prevent storage leaks of ext_backups directories, and (2) releasing readLock on Load() errors via defer to prevent orphaned LOCK.READ.* entries.
Out of Scope Changes check ✅ Passed All code changes align with the stated objectives: MergeMigrations and readLock cleanup fixes in stream_metas.go and log_client/client.go; supporting tests; and helper function for test infrastructure. No unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.11.4)

Command failed


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

MergeMigrations only appended m1's IngestedSstPaths, silently dropping
m2's. During truncate (MergeAndMigrateTo -> processExtFullBackup), every
merged layer's ext_backups/ directory became invisible to the cleanup
logic and stayed on storage forever. Today this is a storage leak; once
lease-based lock expiration lands, an auto-reclaimed read lock followed
by a truncate can strand SSTs that PiTR still needs.

ref pingcap#67819
@RidRisR RidRisR force-pushed the bugfix/merge-migrations-preserves-ingested-sst-paths branch from ea6cd51 to 302c968 Compare April 17, 2026 08:51
New TestMergeAndMigrateToBoundsIngestedSstPathsOverTruncates simulates
N rounds of AppendMigration followed by MergeAndMigrateTo with ascending
TruncatedTo. Each round, the previous round's Finished group becomes
eligible for cleanup (GroupTS < TruncatedTo) while the current one stays.
Asserts BASE.IngestedSstPaths remains <= 1 and prior ext_backups
directories are physically deleted. This demonstrates the prior commit's
fix reconnects processExtFullBackup's pruning pipeline and that merge
does not accumulate paths unboundedly across truncate cycles.
@ti-chi-bot ti-chi-bot Bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 17, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 78.2898%. Comparing base (801b1d3) to head (72d2b7f).
⚠️ Report is 75 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #67850        +/-   ##
================================================
+ Coverage   78.0839%   78.2898%   +0.2059%     
================================================
  Files          1959       1983        +24     
  Lines        543377     558131     +14754     
================================================
+ Hits         424290     436960     +12670     
- Misses       118084     120132      +2048     
- Partials       1003       1039        +36     
Flag Coverage Δ
integration 44.4329% <62.5000%> (-2.1212%) ⬇️
unit 76.3018% <87.5000%> (-0.0698%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 61.4164% <ø> (ø)
parser ∅ <ø> (∅)
br 65.7482% <87.5000%> (-0.1038%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 17, 2026
@RidRisR RidRisR changed the title br/pkg/stream: preserve m2's IngestedSstPaths in MergeMigrations br: pre-lease log-backup storage and readLock leak fixes Apr 17, 2026
GetLockedMigrations acquires a read lock on v1/LOCK via GetReadLock
then calls Load() to enumerate migrations. When Load() fails (network
glitch, corrupted migration file, permission hiccup, etc.), the error
was returned directly without releasing the already-acquired lock,
orphaning v1/LOCK.READ.* on remote storage until it is manually
removed. This is the prerequisite flagged in the lease-based lock
expiration design: once auto-reclamation is in place, a read lock
that the holder never had a chance to register a renewal for will
stay orphaned until CLI cleanup, so closing this leak first keeps
failure handling honest even under the current no-expiration regime.

Use a named return + defer so any failure path after GetReadLock
cleans up the lock, not just the Load() branch. Happy path is
unchanged.

ref pingcap#67819
@RidRisR RidRisR force-pushed the bugfix/merge-migrations-preserves-ingested-sst-paths branch from 09acec9 to 72d2b7f Compare April 17, 2026 09:53
@RidRisR RidRisR requested a review from Leavrth April 17, 2026 09:57
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 17, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Leavrth

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot Bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Apr 17, 2026
@ti-chi-bot
Copy link
Copy Markdown

ti-chi-bot Bot commented Apr 17, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-04-17 10:02:42.97693247 +0000 UTC m=+1728168.182292527: ☑️ agreed by Leavrth.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved do-not-merge/needs-triage-completed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

br: MergeMigrations drops IngestedSstPaths from layers, causing storage leak of ext_backups

2 participants