br: pre-lease log-backup storage and readLock leak fixes#67850
br: pre-lease log-backup storage and readLock leak fixes#67850RidRisR wants to merge 3 commits intopingcap:masterfrom
Conversation
|
@RidRisR I've received your pull request and will start the review. I'll conduct a thorough review covering code quality, potential issues, and implementation details. ⏳ This process typically takes 10-30 minutes depending on the complexity of the changes. ℹ️ Learn more details on Pantheon AI. |
|
Hi @RidRisR. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughFix MergeMigrations to preserve Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.11.4)Command failed Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
MergeMigrations only appended m1's IngestedSstPaths, silently dropping m2's. During truncate (MergeAndMigrateTo -> processExtFullBackup), every merged layer's ext_backups/ directory became invisible to the cleanup logic and stayed on storage forever. Today this is a storage leak; once lease-based lock expiration lands, an auto-reclaimed read lock followed by a truncate can strand SSTs that PiTR still needs. ref pingcap#67819
ea6cd51 to
302c968
Compare
New TestMergeAndMigrateToBoundsIngestedSstPathsOverTruncates simulates N rounds of AppendMigration followed by MergeAndMigrateTo with ascending TruncatedTo. Each round, the previous round's Finished group becomes eligible for cleanup (GroupTS < TruncatedTo) while the current one stays. Asserts BASE.IngestedSstPaths remains <= 1 and prior ext_backups directories are physically deleted. This demonstrates the prior commit's fix reconnects processExtFullBackup's pruning pipeline and that merge does not accumulate paths unboundedly across truncate cycles.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #67850 +/- ##
================================================
+ Coverage 78.0839% 78.2898% +0.2059%
================================================
Files 1959 1983 +24
Lines 543377 558131 +14754
================================================
+ Hits 424290 436960 +12670
- Misses 118084 120132 +2048
- Partials 1003 1039 +36
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
GetLockedMigrations acquires a read lock on v1/LOCK via GetReadLock then calls Load() to enumerate migrations. When Load() fails (network glitch, corrupted migration file, permission hiccup, etc.), the error was returned directly without releasing the already-acquired lock, orphaning v1/LOCK.READ.* on remote storage until it is manually removed. This is the prerequisite flagged in the lease-based lock expiration design: once auto-reclamation is in place, a read lock that the holder never had a chance to register a renewal for will stay orphaned until CLI cleanup, so closing this leak first keeps failure handling honest even under the current no-expiration regime. Use a named return + defer so any failure path after GetReadLock cleans up the lock, not just the Load() branch. Happy path is unchanged. ref pingcap#67819
09acec9 to
72d2b7f
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Leavrth The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: close #67819
Problem Summary:
This PR bundles two storage-leak fixes in the log-backup subsystem that must land before the lease-based lock expiration work can proceed. Both surface naturally during the same audit of what goes wrong when a process holding a log-backup lock fails to tear down cleanly.
Fix 1 —
MergeMigrationsdrops layerIngestedSstPaths(br/pkg/stream/stream_metas.go)MergeMigrations()appendsm1.GetIngestedSstPaths()but silently dropsm2.GetIngestedSstPaths(). The other repeated fields in the same function (Compactions,DestructPrefix) concat both sides, so this is an asymmetric copy-paste omission.MergeAndMigrateTo(called duringbr log truncate) invokesMergeMigrationsto fold each layer into the new BASE. Because m2'sIngestedSstPathsare dropped,processExtFullBackupnever sees layer-originated paths and cannot evaluate them against theFinished && GroupTS < TruncatedTocleanup condition.v1/ext_backups/directory becomes a permanent orphan on the log-backup storage — a steady-state storage leak.IngestedSstPathsper-layer viaListAll()(seebr/pkg/restore/log_client/migration.go), so data consumption is unaffected today.Fix 2 —
GetLockedMigrationsleaksreadLockonLoad()error (br/pkg/restore/log_client/client.go)GetLockedMigrations()acquires a read lock onv1/LOCKviaGetReadLock, then callsext.Load(ctx)to enumerate migrations. WhenLoad()fails — for instance because a.mgrtfile is corrupted, the migrations directory is transiently unreadable, ormigIdOfcannot parse a name — the function returns the error without releasing the lock it just acquired, orphaningv1/LOCK.READ.*on remote storage until it is manually removed.Once lease-based auto-reclamation lands, this path is worse: a caller that never receives the
RemoteLockcannot register a renewal, so the orphaned lock never refreshes itsExpireAt— and the CLI escape hatch becomes the only recourse. Closing the leak now keeps cleanup honest even under today's no-expiration regime.What changed and how does it work?
Fix 1: one-line append of
m2.GetIngestedSstPaths()inMergeMigrations, symmetric withCompactions/DestructPrefixhandling.Behavior change note: after this fix, the first
br log truncaterun against a storage that previously suffered the drop will start correctly evaluating layer-originatedIngestedSstPathsagainstprocessExtFullBackup's existing conditions (unchanged logic). Directories satisfyingFinished=true && GroupTS < TruncatedToget cleaned; others get carried into the new BASE. This does not retroactively recover orphan directories whose references were already lost in a historical merge — a separate scan tool is a follow-up independent task.Other call sites of
MergeMigrationsaudited:MergeAndMigrateTo(stream_metas.go~L940): the target path — now cleans up properly.MergeToBy(stream_metas.go~L820): only reachable throughMergeTowhich has no production callers today.doTruncateLogs(stream_metas.go~L1479): bothr.NewBaseandaOut.NewBasehave emptyIngestedSstPathsat this site — fix has no effect here.Fix 2: add
readLock.UnlockOnCleanUp(ctx)on the error path. Happy path unchanged.Check List
Tests
New tests:
TestMergeMigrationsPreservesIngestedSstPaths— direct primitive-level test; verified to FAIL on unfixed code, PASS after fix.TestMergeAndMigrateToBoundsIngestedSstPathsOverTruncates— 10-round integration test proving that after theMergeMigrationsfix,processExtFullBackup's pruning pipeline is reconnected: BASE'sIngestedSstPathsstays bounded (≤1) and staleext_backups/directories are physically deleted. Also failed against unfixed code, passed after fix.TestGetLockedMigrationsReleasesReadLockOnLoadError— injects a malformed migration file to forceLoad()failure, asserts nov1/LOCK.READ.*remains on storage. Failed against unfixed code (lingeringv1/LOCK.READ.{random}), passed after fix.Full
br/pkg/streamandbr/pkg/restore/log_clientpackage sweeps (go test -tags=intestwith failpoints enabled) stay green.Side effects
Documentation
Release note
Summary by CodeRabbit
Release Notes
Bug Fixes
Tests