refactor(sandbox-manager): migrate cache layer to controller-runtime architecture#287
Conversation
2e4ca01 to
2d56f8c
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #287 +/- ##
==========================================
+ Coverage 74.03% 74.58% +0.55%
==========================================
Files 128 141 +13
Lines 9488 9835 +347
==========================================
+ Hits 7024 7335 +311
- Misses 2139 2189 +50
+ Partials 325 311 -14
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e0edb8e to
1e9d2a2
Compare
1e9d2a2 to
c81f7ba
Compare
|
@furykerry DON'T PANIC, a lot of Apache license headers have just been added |
|
@codex review |
911ccf4 to
6ae53a1
Compare
furykerry
left a comment
There was a problem hiding this comment.
Code Review: PR #287 — Migrate sandbox-manager cache layer to controller-runtime architecture
Review session: 2026-04-27T10:30
Scope: 218 files, +11,162 / -4,289 lines
Critical Issues (Must Fix)
1. DefaultUnsafeDisableDeepCopy: ptr.To(true) — latent data race in all cache read paths
- Location:
pkg/cache/cache.go(NewControllerManager, around L135 in the patch) - Issue: This disables deep-copy for ALL cached objects globally. Every
GetClaimedSandbox,GetCheckpoint,PickSandboxSet,ListSandboxesInPoolreturns pointers directly into the informer store. Combined withsingleflight.Group, concurrent callers receive the same pointer without deep-copy guarantees. - Why: Any mutation of returned objects (even well-intentioned) corrupts the cache for all other consumers. The
retryUpdatefunction insandbox.godoescopied := sbx.DeepCopy()but this is not consistently enforced across all callers. This is a production data-corruption-landmine. - Fix: Either:
- (a) Remove
DefaultUnsafeDisableDeepCopyentirely and rely on controller-runtime's default deep-copy behavior, OR - (b) Keep it but ensure every read method performs an explicit deep-copy before returning, and document that callers must treat returned objects as read-only.
- (a) Remove
2. Cache.Run swallows manager start errors — silent failure path
- Location:
pkg/cache/cache.go(Run method, around L190 in the patch) - Issue:
mgr.Start()is launched in a goroutine and its error is only logged, never returned to the caller. If the manager fails to start (beforeWaitForCacheSyncchecks), the caller seeserr == niland proceeds as if everything is operational. - Why: A partially failed cache will cause cascading failures downstream with confusing errors, or worse, silently return stale/empty data from an unpopulated informer store.
- Fix: Use an errgroup or a channel to propagate
mgr.Start()errors synchronously, or check that the manager actually started before returning nil:
startErrCh := make(chan error, 1)
go func() {
startErrCh <- c.mgr.Start(mgrCtx)
}()
cache := c.mgr.GetCache()
if cache != nil && !cache.WaitForCacheSync(ctx) {
return fmt.Errorf("timed out waiting for caches to sync")
}
select {
case err := <-startErrCh:
if err != nil {
return fmt.Errorf("controller manager failed to start: %w", err)
}
default:
}
return nil3. Wait hook reuse silently ignores second caller's satisfiedFunc
- Location:
pkg/cache/utils/wait.go(WaitForObjectSatisfied, around L116-L138 in the patch) - Issue: When
LoadOrStorefinds an existing entry with the sameaction, the code reuses it. But the existing entry has a differentsatisfiedFuncfrom the current caller. The second caller's condition may never be checked, causing the second wait to either never complete or complete on the wrong condition. - Why: Two concurrent waits on the same sandbox with the same action (e.g., both
WaitReady) but different conditions (e.g., one checksPhase == Running, another checksPodIP != "") — the second caller gets the first caller's checker. This is a correctness bug. - Fix: Compare the
satisfiedFunc(or a hash) when reusing an entry, or return an error when an existing entry has a different checker:
if evaluateIfDifferent(entry.checker, satisfiedFunc) {
return fmt.Errorf("another wait entry exists with different checker")
}Warnings (Should Fix)
4. AddIndexesToCache uses context.Background() — should be parameterized
- Location:
pkg/cache/index.go(AddIndexesToCache, around L150 in the patch) - Issue: Index field registration uses hardcoded
context.Background(). - Recommendation: Accept a
ctx context.Contextparameter so callers can control cancellation/timeout:
func AddIndexesToCache(ctx context.Context, c ctrlcache.Cache) error {5. newSbx == nil — dead code in InplaceRefresh
- Location:
pkg/sandbox-manager/infra/sandboxcr/sandbox.go(InplaceRefresh method) - Issue:
newSbxis always&agentsv1alpha1.Sandbox{}so the nil check is unreachable. If the API reader path fails,newSbxis still non-nil but empty — the code should checkerr != nilinstead. - Recommendation: Remove the dead nil check and handle the error from the API reader fallback explicitly.
6. singleflight.Group in cache read methods — unintended side effects with UnsafeDisableDeepCopy
- Location:
pkg/cache/cache.go(GetClaimedSandbox, GetCheckpoint, PickSandboxSet, ListSandboxesInPool) - Issue:
singleflight.Groupreturns a shared result to concurrent callers. Without deep-copy, multiple callers hold the same pointer into the store. If one caller mutates it, all others see the mutation. - Recommendation: If
UnsafeDisableDeepCopyis kept, add explicit deep-copies before returning from cache read methods.
7. License header year inconsistency across new files
- Location: Multiple new files in
pkg/cache/ - Issue: Some files use
Copyright 2025(e.g.,cache.go,index.go,controllers/cache_controllers.go), others useCopyright 2026(e.g.,interface.go,utils/wait.go). Changes to existing files correctly addedCopyright 2026. - Recommendation: Standardize all new-file license headers to
Copyright 2026for consistency.
8. ResourceVersionInterceptorFuncs — unnecessary DeepCopyObject() before Get
- Location:
pkg/cache/utils/client.go(ResourceVersionInterceptorFuncs, around L40 in the patch) - Issue:
latest := obj.DeepCopyObject().(ctrlclient.Object)allocates a full copy that is immediately overwritten byclient.Get(). - Recommendation: Remove the deep-copy before Get:
Update: func(ctx context.Context, client ctrlclient.WithWatch, obj ctrlclient.Object, opts ...ctrlclient.UpdateOption) error {
// Create an empty object to receive the latest version
latest := obj.DeepCopyObject().(ctrlclient.Object)
// Set empty resourceVersion so Get can fill it in
latest.SetResourceVersion("")
if err := client.Get(ctx, types.NamespacedName{Name: obj.GetName(), Namespace: obj.GetNamespace()}, latest); err == nil {
obj.SetResourceVersion(latest.GetResourceVersion())
}
return client.Update(ctx, obj, opts...)
},Suggestions (Consider Improving)
getTemplateFromSandboxduplication: [pkg/cache/index.go] duplicates the logic from [pkg/sandbox-manager/infra/sandboxcr/infra.go]. The comment acknowledges the circular import avoidance, but consider extracting topkg/utils/sandboxutilsas a shared function.- Hardcoded 30s post-resume timeout: In [
pkg/sandbox-manager/infra/sandboxcr/sandbox.go] Resume method, thecontext.WithTimeout(context.Background(), 30*time.Second)is arbitrary. Consider making this a named constant or config option. - 570-line
wait_test.go: [pkg/cache/utils/wait_test.go] could be split by concern (WaitHookKey tests, WaitEntry tests, WaitForObjectSatisfied tests) for better readability. HasTemplatechanged from lock-freesync.Mapto informer read: Old implementation tracked SandboxSet existence via async.Mapupdated by event handlers. New implementation callsPickSandboxSet(which goes through singleflight) on everyHasTemplatecall. If this is a hot path, consider caching the result.- Generated files check:
client/has been modified (generic_client.go,registry.go) — verify these are intentionally edited or whether they should be regenerated viamake generate.
Positive Findings
- Clean interface extraction:
cache.Providerin [pkg/cache/interface.go] is well-documented with clear method contracts. Removing the oldinfra.CacheProviderinterface is the right abstraction. - Builder pattern:
SandboxManagerBuilderandInfraBuildermake dependency injection explicit and improve testability.WithCustomInfrais a clean extension point. - Generic reconcilers:
CustomReconciler[T]andWaitReconciler[T]in [pkg/cache/controllers/cache_controllers.go] eliminate boilerplate across resource types. - Single source of truth for indexes:
GetIndexFuncs()shared between production and test prevents drift. - MockManager: Comprehensive mock implementation with
WithFailOnNthAddand wait simulation enables rigorous error-path and async testing. - WaitEntry type-safety: Generic
WaitEntry[T]withCheckFunc[T]provides compile-time type safety for async waits. - Test helpers:
cachetest.NewTestCacheandResourceVersionInterceptorFuncsprovide solid, reusable test infrastructure. - All controller tests are table-driven with descriptive names and
expectError stringpattern, matching project conventions. - Migration path clean: Legacy
Cachedeleted;CacheV2already renamed toCache— no transitional naming cruft remains.
Summary
| Category | Count | Details |
|---|---|---|
| Critical Issues | 3 | UnsafeDisableDeepCopy, Run error propagation, Wait hook reuse |
| Warnings | 5 | Background context, Dead code, Singleflight+deepcopy, License, unnecessary allocation |
| Suggestions | 5 | Duplication, timeout constant, test size, template lookup perf, generated files |
| Positive | 9 | Interface design, patterns, tests, test infrastructure |
The architecture and design direction are excellent. The three critical issues are correctness concerns that should be resolved before merge — particularly #1 (UnsafeDisableDeepCopy) which is the highest risk item in the entire diff.
🤖 Generated with Qoder
f6baf81 to
4fcc4d0
Compare
589bd32 to
ac374c1
Compare
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: furykerry The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
This PR is a work-in-progress refactoring to migrate the sandbox-manager cache layer from the legacy informer-based implementation to a controller-runtime based architecture. The new implementation is temporarily named
CacheV2during the transition period, and will be renamed toCacheonce the migration is complete and the legacy implementation is removed.Background
The current cache implementation uses custom informers with manual lifecycle management. This refactoring aims to:
controller-runtime's native cache and client capabilitiesCurrent Progress
New Architecture (
pkg/sandbox-manager/infra/sandboxcr/cache/)Core Components:
CacheV2: Controller-runtime based cache (will be renamed toCacheafter migration)Cache Controllers (
cache/controllers/):CustomReconciler[T]for extensible event handlingWaitReconciler[T]for async wait operationsMockManagerfor testingUtilities (
cache/utils/):WaitEntry[T]Migration Status
Testing
NewTestCacheV2Notes
CacheV2naming is temporary to allow parallel existence during migration