Skip to content

refactor(sandbox-manager): migrate cache layer to controller-runtime architecture#287

Merged
furykerry merged 2 commits intoopenkruise:masterfrom
AiRanthem:feature/controller-runtime-for-manager-260409
Apr 29, 2026
Merged

refactor(sandbox-manager): migrate cache layer to controller-runtime architecture#287
furykerry merged 2 commits intoopenkruise:masterfrom
AiRanthem:feature/controller-runtime-for-manager-260409

Conversation

@AiRanthem
Copy link
Copy Markdown
Member

@AiRanthem AiRanthem commented Apr 16, 2026

Summary

This PR is a work-in-progress refactoring to migrate the sandbox-manager cache layer from the legacy informer-based implementation to a controller-runtime based architecture. The new implementation is temporarily named CacheV2 during the transition period, and will be renamed to Cache once the migration is complete and the legacy implementation is removed.

Background

The current cache implementation uses custom informers with manual lifecycle management. This refactoring aims to:

  1. Leverage controller-runtime's native cache and client capabilities
  2. Improve testability with fake client support
  3. Align with standard Kubernetes operator patterns
  4. Simplify the codebase by removing duplicate informer management logic

Current Progress

New Architecture (pkg/sandbox-manager/infra/sandboxcr/cache/)

Core Components:

  • CacheV2: Controller-runtime based cache (will be renamed to Cache after migration)
  • Field index system for efficient queries
  • Configurable informer filtering (namespace + label selectors)
  • Proper lifecycle management with context

Cache Controllers (cache/controllers/):

  • Generic CustomReconciler[T] for extensible event handling
  • Generic WaitReconciler[T] for async wait operations
  • Resource-specific controllers for Sandbox and SandboxSet
  • MockManager for testing

Utilities (cache/utils/):

  • Type-safe wait hooks with WaitEntry[T]
  • Resource version conflict handling for tests

Migration Status

Component Status
CacheV2 core implementation ✅ Done
Cache controllers ✅ Done
Field indexes ✅ Done
Unit tests ✅ Done
Integration tests ✅ Done
Legacy Cache removal ✅ Done
Rename CacheV2 → Cache ✅ Done

Testing

  • Unit tests updated to use NewTestCacheV2
  • New tests for all cache controllers
  • Mock manager with wait simulation for async testing

Notes

  • This is a breaking internal change but the external interface remains stable
  • The CacheV2 naming is temporary to allow parallel existence during migration
  • Once fully validated, a follow-up PR will perform the rename and cleanup

@AiRanthem AiRanthem force-pushed the feature/controller-runtime-for-manager-260409 branch 2 times, most recently from 2e4ca01 to 2d56f8c Compare April 20, 2026 12:53
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 74.53581% with 288 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.58%. Comparing base (ff7daba) to head (ac374c1).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/cache/cache.go 57.71% 62 Missing and 12 partials ⚠️
pkg/cache/cachetest/cachetest.go 0.00% 32 Missing ⚠️
pkg/peers/memberlist.go 54.00% 19 Missing and 4 partials ⚠️
pkg/servers/e2b/core.go 33.33% 20 Missing and 2 partials ⚠️
pkg/cache/index.go 61.81% 15 Missing and 6 partials ⚠️
pkg/cache/controllers/test_helpers.go 85.59% 16 Missing and 1 partial ⚠️
pkg/sandbox-manager/core.go 75.75% 15 Missing and 1 partial ⚠️
pkg/sandbox-manager/infra/sandboxcr/sandbox.go 69.76% 8 Missing and 5 partials ⚠️
pkg/cache/controllers/cache_controllers.go 86.41% 7 Missing and 4 partials ⚠️
pkg/cache/tasks.go 81.81% 6 Missing and 4 partials ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #287      +/-   ##
==========================================
+ Coverage   74.03%   74.58%   +0.55%     
==========================================
  Files         128      141      +13     
  Lines        9488     9835     +347     
==========================================
+ Hits         7024     7335     +311     
- Misses       2139     2189      +50     
+ Partials      325      311      -14     
Flag Coverage Δ
unittests 74.58% <74.53%> (+0.55%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AiRanthem AiRanthem force-pushed the feature/controller-runtime-for-manager-260409 branch from e0edb8e to 1e9d2a2 Compare April 21, 2026 08:03
Comment thread pkg/cache/cache.go
Comment thread pkg/sandbox-manager/infra/sandboxcr/cache/cache.go Outdated
@AiRanthem AiRanthem force-pushed the feature/controller-runtime-for-manager-260409 branch from 1e9d2a2 to c81f7ba Compare April 22, 2026 09:40
@AiRanthem
Copy link
Copy Markdown
Member Author

@furykerry DON'T PANIC, a lot of Apache license headers have just been added

@AiRanthem
Copy link
Copy Markdown
Member Author

@codex review

@AiRanthem AiRanthem changed the title [WIP] refactor(sandbox-manager): migrate cache layer to controller-runtime architecture refactor(sandbox-manager): migrate cache layer to controller-runtime architecture Apr 22, 2026
Comment thread pkg/cache/cachetest/cachetest.go
Comment thread pkg/cache/controllers/test_helpers_test.go Outdated
Comment thread pkg/cache/utils/wait.go Outdated
Comment thread pkg/cache/utils/wait.go Outdated
Comment thread pkg/cache/utils/wait.go Outdated
Comment thread pkg/sandbox-manager/consts/consts.go Outdated
Comment thread proto/envd/process/processconnect/process.connect.go Outdated
Comment thread test/e2e/utils/utils.go Outdated
Comment thread pkg/utils/sandbox-manager/expectationutils/utils.go
Comment thread pkg/servers/e2b/templates_test.go Outdated
Comment thread pkg/cache/index.go Outdated
Comment thread pkg/peers/memberlist.go Outdated
Comment thread pkg/utils/csiutils/storages_provider.go Outdated
Comment thread pkg/sandbox-manager/infra/sandboxcr/infra.go Outdated
Comment thread pkg/sandbox-manager/infra/sandboxcr/infra.go Outdated
Comment thread pkg/cache/cache_test.go Outdated
@AiRanthem AiRanthem force-pushed the feature/controller-runtime-for-manager-260409 branch 2 times, most recently from 911ccf4 to 6ae53a1 Compare April 27, 2026 03:30
Copy link
Copy Markdown
Member

@furykerry furykerry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #287 — Migrate sandbox-manager cache layer to controller-runtime architecture

Review session: 2026-04-27T10:30
Scope: 218 files, +11,162 / -4,289 lines


Critical Issues (Must Fix)

1. DefaultUnsafeDisableDeepCopy: ptr.To(true) — latent data race in all cache read paths

  • Location: pkg/cache/cache.go (NewControllerManager, around L135 in the patch)
  • Issue: This disables deep-copy for ALL cached objects globally. Every GetClaimedSandbox, GetCheckpoint, PickSandboxSet, ListSandboxesInPool returns pointers directly into the informer store. Combined with singleflight.Group, concurrent callers receive the same pointer without deep-copy guarantees.
  • Why: Any mutation of returned objects (even well-intentioned) corrupts the cache for all other consumers. The retryUpdate function in sandbox.go does copied := sbx.DeepCopy() but this is not consistently enforced across all callers. This is a production data-corruption-landmine.
  • Fix: Either:
    • (a) Remove DefaultUnsafeDisableDeepCopy entirely and rely on controller-runtime's default deep-copy behavior, OR
    • (b) Keep it but ensure every read method performs an explicit deep-copy before returning, and document that callers must treat returned objects as read-only.

2. Cache.Run swallows manager start errors — silent failure path

  • Location: pkg/cache/cache.go (Run method, around L190 in the patch)
  • Issue: mgr.Start() is launched in a goroutine and its error is only logged, never returned to the caller. If the manager fails to start (before WaitForCacheSync checks), the caller sees err == nil and proceeds as if everything is operational.
  • Why: A partially failed cache will cause cascading failures downstream with confusing errors, or worse, silently return stale/empty data from an unpopulated informer store.
  • Fix: Use an errgroup or a channel to propagate mgr.Start() errors synchronously, or check that the manager actually started before returning nil:
startErrCh := make(chan error, 1)
go func() {
    startErrCh <- c.mgr.Start(mgrCtx)
}()
cache := c.mgr.GetCache()
if cache != nil && !cache.WaitForCacheSync(ctx) {
    return fmt.Errorf("timed out waiting for caches to sync")
}
select {
case err := <-startErrCh:
    if err != nil {
        return fmt.Errorf("controller manager failed to start: %w", err)
    }
default:
}
return nil

3. Wait hook reuse silently ignores second caller's satisfiedFunc

  • Location: pkg/cache/utils/wait.go (WaitForObjectSatisfied, around L116-L138 in the patch)
  • Issue: When LoadOrStore finds an existing entry with the same action, the code reuses it. But the existing entry has a different satisfiedFunc from the current caller. The second caller's condition may never be checked, causing the second wait to either never complete or complete on the wrong condition.
  • Why: Two concurrent waits on the same sandbox with the same action (e.g., both WaitReady) but different conditions (e.g., one checks Phase == Running, another checks PodIP != "") — the second caller gets the first caller's checker. This is a correctness bug.
  • Fix: Compare the satisfiedFunc (or a hash) when reusing an entry, or return an error when an existing entry has a different checker:
if evaluateIfDifferent(entry.checker, satisfiedFunc) {
    return fmt.Errorf("another wait entry exists with different checker")
}

Warnings (Should Fix)

4. AddIndexesToCache uses context.Background() — should be parameterized

  • Location: pkg/cache/index.go (AddIndexesToCache, around L150 in the patch)
  • Issue: Index field registration uses hardcoded context.Background().
  • Recommendation: Accept a ctx context.Context parameter so callers can control cancellation/timeout:
func AddIndexesToCache(ctx context.Context, c ctrlcache.Cache) error {

5. newSbx == nil — dead code in InplaceRefresh

  • Location: pkg/sandbox-manager/infra/sandboxcr/sandbox.go (InplaceRefresh method)
  • Issue: newSbx is always &agentsv1alpha1.Sandbox{} so the nil check is unreachable. If the API reader path fails, newSbx is still non-nil but empty — the code should check err != nil instead.
  • Recommendation: Remove the dead nil check and handle the error from the API reader fallback explicitly.

6. singleflight.Group in cache read methods — unintended side effects with UnsafeDisableDeepCopy

  • Location: pkg/cache/cache.go (GetClaimedSandbox, GetCheckpoint, PickSandboxSet, ListSandboxesInPool)
  • Issue: singleflight.Group returns a shared result to concurrent callers. Without deep-copy, multiple callers hold the same pointer into the store. If one caller mutates it, all others see the mutation.
  • Recommendation: If UnsafeDisableDeepCopy is kept, add explicit deep-copies before returning from cache read methods.

7. License header year inconsistency across new files

  • Location: Multiple new files in pkg/cache/
  • Issue: Some files use Copyright 2025 (e.g., cache.go, index.go, controllers/cache_controllers.go), others use Copyright 2026 (e.g., interface.go, utils/wait.go). Changes to existing files correctly added Copyright 2026.
  • Recommendation: Standardize all new-file license headers to Copyright 2026 for consistency.

8. ResourceVersionInterceptorFuncs — unnecessary DeepCopyObject() before Get

  • Location: pkg/cache/utils/client.go (ResourceVersionInterceptorFuncs, around L40 in the patch)
  • Issue: latest := obj.DeepCopyObject().(ctrlclient.Object) allocates a full copy that is immediately overwritten by client.Get().
  • Recommendation: Remove the deep-copy before Get:
Update: func(ctx context.Context, client ctrlclient.WithWatch, obj ctrlclient.Object, opts ...ctrlclient.UpdateOption) error {
    // Create an empty object to receive the latest version
    latest := obj.DeepCopyObject().(ctrlclient.Object)
    // Set empty resourceVersion so Get can fill it in
    latest.SetResourceVersion("")
    if err := client.Get(ctx, types.NamespacedName{Name: obj.GetName(), Namespace: obj.GetNamespace()}, latest); err == nil {
        obj.SetResourceVersion(latest.GetResourceVersion())
    }
    return client.Update(ctx, obj, opts...)
},

Suggestions (Consider Improving)

  • getTemplateFromSandbox duplication: [pkg/cache/index.go] duplicates the logic from [pkg/sandbox-manager/infra/sandboxcr/infra.go]. The comment acknowledges the circular import avoidance, but consider extracting to pkg/utils/sandboxutils as a shared function.
  • Hardcoded 30s post-resume timeout: In [pkg/sandbox-manager/infra/sandboxcr/sandbox.go] Resume method, the context.WithTimeout(context.Background(), 30*time.Second) is arbitrary. Consider making this a named constant or config option.
  • 570-line wait_test.go: [pkg/cache/utils/wait_test.go] could be split by concern (WaitHookKey tests, WaitEntry tests, WaitForObjectSatisfied tests) for better readability.
  • HasTemplate changed from lock-free sync.Map to informer read: Old implementation tracked SandboxSet existence via a sync.Map updated by event handlers. New implementation calls PickSandboxSet (which goes through singleflight) on every HasTemplate call. If this is a hot path, consider caching the result.
  • Generated files check: client/ has been modified (generic_client.go, registry.go) — verify these are intentionally edited or whether they should be regenerated via make generate.

Positive Findings

  • Clean interface extraction: cache.Provider in [pkg/cache/interface.go] is well-documented with clear method contracts. Removing the old infra.CacheProvider interface is the right abstraction.
  • Builder pattern: SandboxManagerBuilder and InfraBuilder make dependency injection explicit and improve testability. WithCustomInfra is a clean extension point.
  • Generic reconcilers: CustomReconciler[T] and WaitReconciler[T] in [pkg/cache/controllers/cache_controllers.go] eliminate boilerplate across resource types.
  • Single source of truth for indexes: GetIndexFuncs() shared between production and test prevents drift.
  • MockManager: Comprehensive mock implementation with WithFailOnNthAdd and wait simulation enables rigorous error-path and async testing.
  • WaitEntry type-safety: Generic WaitEntry[T] with CheckFunc[T] provides compile-time type safety for async waits.
  • Test helpers: cachetest.NewTestCache and ResourceVersionInterceptorFuncs provide solid, reusable test infrastructure.
  • All controller tests are table-driven with descriptive names and expectError string pattern, matching project conventions.
  • Migration path clean: Legacy Cache deleted; CacheV2 already renamed to Cache — no transitional naming cruft remains.

Summary

Category Count Details
Critical Issues 3 UnsafeDisableDeepCopy, Run error propagation, Wait hook reuse
Warnings 5 Background context, Dead code, Singleflight+deepcopy, License, unnecessary allocation
Suggestions 5 Duplication, timeout constant, test size, template lookup perf, generated files
Positive 9 Interface design, patterns, tests, test infrastructure

The architecture and design direction are excellent. The three critical issues are correctness concerns that should be resolved before merge — particularly #1 (UnsafeDisableDeepCopy) which is the highest risk item in the entire diff.


🤖 Generated with Qoder

@AiRanthem AiRanthem force-pushed the feature/controller-runtime-for-manager-260409 branch from f6baf81 to 4fcc4d0 Compare April 28, 2026 09:48
Comment thread pkg/sandbox-manager/infra/sandboxcr/clone.go
Comment thread pkg/cache/tasks.go Outdated
Comment thread pkg/cache/tasks.go Outdated
@AiRanthem AiRanthem force-pushed the feature/controller-runtime-for-manager-260409 branch from 589bd32 to ac374c1 Compare April 29, 2026 11:15
Copy link
Copy Markdown
Member

@furykerry furykerry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: furykerry

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@furykerry furykerry merged commit f2be466 into openkruise:master Apr 29, 2026
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants