KEP: introduce Federated Stateful Rollout (Coordinated Blue-Green Migration) by liwang0513 · Pull Request #7367 · karmada-io/karmada

liwang0513 · 2026-04-05T19:56:12Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces the StatefulMigrationController within the karmada-controller-manager to enable coordinated, stateful migrations for stateful applications (Apache Flink as an example).

Why we need it:

Safe Upgrades: It allows for image and topology updates by using Savepoints as a portable state bridge between different versions of a workload across clusters.
Automated Orchestration: It automates the complex "Suspend-Capture-Inject-Resume" lifecycle, which previously required manual manipulation of ResourceBindings and Overrides.
Validation Gate: It maintains the source cluster as a "hot standby" until the target cluster is confirmed RUNNING, providing a safety net for critical stateful jobs.

Which issue(s) this PR fixes:

Fixes #7291

Special notes for your reviewer:

ResourceInterpreter Integration: This controller is operator-agnostic. It leverages the ResourceInterpreter framework to "read" savepoint metadata from Flink status fields without hardcoding Flink logic into the core.
Binding Manipulation: To bypass maxGroups: 1 constraints during migration, the controller temporarily expands the ResourceBinding.spec.clusters list. A Migration Finalizer is implemented to prevent the Scheduler from reverting this expansion until the handoff is complete.
Atomic Overrides: The transition uses a "Hold" pattern via ClusterOverridePolicy to keep the target instance suspended until the captured Savepoint URL is injected.

Does this PR introduce a user-facing change?:

Introduced the Federated Stateful Rollout feature (protected by the StatefulMigration feature gate). This allows users to perform coordinated, zero-backlog migrations of Flink workloads across clusters by using the karmada.io/migration-target annotation or cluster taints.

gemini-code-assist · 2026-04-05T19:56:20Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the design proposal for a new Federated Stateful Rollout feature within the Karmada ecosystem. The primary objective is to enable coordinated, stateful migrations of workloads across clusters, specifically targeting streaming applications like Apache Flink. By automating the transition lifecycle, the proposal aims to ensure zero data loss and eliminate reprocessing lag during planned maintenance or upgrades.

Highlights

Proposal Documentation: Added a new KEP document detailing the design and implementation strategy for the Federated Stateful Rollout feature.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

karmada-bot · 2026-04-05T19:56:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign seanlaii for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request proposes a Federated Stateful Rollout mechanism for coordinated blue-green migrations of stateful workloads, specifically targeting Flink applications. The feedback suggests standardizing the controller name, providing detailed examples for state handoff, and utilizing the ResourceInterpreter framework for operator-agnosticism. Additionally, the reviewer recommends using idiomatic scheduling suspension instead of custom finalizers and requests clarification on the replica count logic in ResourceBinding patches to prevent unintended scaling.

gemini-code-assist · 2026-04-05T19:57:53Z

+| **3. Expansion** | Controller patches `ResourceBinding` (replicas: 2) and adds a Finalizer. | Transparent |
+| **4. Hold** | Controller applies `ClusterOverridePolicy` to Target (state: `suspended`). | Transparent |
+| **5. Capture** | Controller patches Source to `suspended`, triggers Savepoint. | Transparent |
+| **6. Handoff** | Controller injects Savepoint URL into Target Override and flips to `running`. | Transparent |


Phase 6 of the state machine mentions injecting the Savepoint URL into the Target Override. It would be beneficial to include a concrete example of how this injection is structured in the ClusterOverridePolicy within the 'Design Details' section, especially since stateful workloads like Flink typically require specific fields (e.g., initialSavepointPath) to be set for resuming from a state.

gemini-code-assist · 2026-04-05T19:57:53Z

+
+## Design Details
+
+### The "Hold" Pattern via ClusterOverridePolicy


The proposal states that the controller is operator-agnostic and leverages the ResourceInterpreter framework. However, the 'Hold' pattern example (lines 55-73) uses Flink-specific paths (/spec/job/state) and values (suspended). To truly achieve operator-agnosticism, the controller should ideally obtain these 'suspend/resume' patches from the ResourceInterpreter or a configurable template, rather than having them hardcoded or specifically tailored for Flink in the core logic.

gemini-code-assist · 2026-04-05T19:57:53Z

+        value: "suspended"
+```
+### ResourceBinding Manipulation
+In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window.


Using a 'Migration Finalizer' to prevent the scheduler from reverting ResourceBinding changes is a non-standard use of finalizers, which are typically for blocking deletion. A more idiomatic approach in Karmada would be to use the spec.suspension.scheduling field in the ResourceBinding to pause scheduler reconciliation during the migration window. This avoids the need for custom scheduler logic to respect a specific finalizer.

Copilot

Pull request overview

This pull request introduces a Kubernetes Enhancement Proposal (KEP) for "Federated Stateful Rollout," a feature that enables coordinated, zero-downtime migrations of stateful workloads (specifically Apache Flink) across multiple Karmada clusters. The proposal describes a 7-phase state machine that automates the "Suspend-Capture-Resume" lifecycle, leveraging synchronous Savepoints to preserve state during migrations while maintaining a "hot standby" on the source cluster until the target is confirmed healthy.

Changes:

Introduces a new proposal document outlining the Federated Stateful Rollout feature design
Describes a state machine-based orchestration approach using ResourceBinding manipulation and ClusterOverridePolicy
Specifies migration trigger mechanisms (cluster taints and annotations) and state transition phases

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T19:59:20Z

+## Design Details
+


The document is missing important sections commonly found in KEP proposals. Specifically: 1) An API Design section showing concrete changes to CRDs or new resource types, 2) Implementation Details explaining how the ResourceInterpreter will be leveraged for state discovery, 3) An explicit feature gate definition for StatefulMigration. While the high-level design is clear, these details are important for implementation and review.

Copilot · 2026-04-05T19:59:21Z

+A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster, the controller aborts and resumes the original job in the source cluster, providing an automated safety net.
+
+## Risks and Mitigations
+- Risk: Split-Brain. Multiple clusters writing to the same sink.
+
+    - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.


In Story 2 (Atomic Image Upgrade), the document states the controller will "abort and resumes the original job in the source cluster" if the new image fails. However, the 7-phase state machine in the Transition State Machine table does not include an abort/rollback phase. How is the rollback sequence implemented if the target cluster fails after the Hold override is applied?

Suggested change

A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster, the controller aborts and resumes the original job in the source cluster, providing an automated safety net.

## Risks and Mitigations

- Risk: Split-Brain. Multiple clusters writing to the same sink.

- Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.

A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster after the source cluster has already been placed in `suspended` state, the controller performs an explicit abort sequence rather than advancing to cutover: (1) keep the source job suspended, (2) mark the target startup/restore as failed, (3) delete or scale down the target workload so no writer remains active there, (4) remove the temporary Hold/Suspend override from the source (or restore the pre-migration source spec), and then (5) resume the original job in the source cluster. This rollback is an error-transition from target bring-up/validation, not a separate steady-state phase, and it preserves the same single-writer invariant as the forward path.

## Risks and Mitigations

- Risk: Split-Brain. Multiple clusters writing to the same sink.

- Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.

- Mitigation: Rollback follows the inverse ordering: "Target-stopped-before-Source-resume". The controller must observe the target as failed and inactive before removing the source suspend override.

Copilot · 2026-04-05T19:59:21Z

+In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window.
+
+```yaml
+# Internal ResourceBinding Patch
+spec:
+  clusters:
+    - name: spaas-kaas-tt-dev02 (Source)
+      replicas: 1
+    - name: spaas-kaas-pw-dev02 (Target)


The ResourceBinding Manipulation section mentions that the controller manually expands the ResourceBinding and adds a "Migration Finalizer". However, the concrete finalizer name and its cleanup logic are not defined. What is the exact finalizer name and how does it interact with the Scheduler's scheduling decisions?

Suggested change

In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window.

```yaml

# Internal ResourceBinding Patch

spec:

clusters:

- name: spaas-kaas-tt-dev02 (Source)

replicas: 1

- name: spaas-kaas-pw-dev02 (Target)

In environments with `maxGroups: 1`, the controller must manually expand the `ResourceBinding` to allow coexistence during the handoff. To make this behavior explicit and deterministic, the controller adds the finalizer `rollout.karmada.io/migration-protect` to the `ResourceBinding` before patching `spec.clusters`.

While `rollout.karmada.io/migration-protect` is present, the rollout controller treats the manually expanded binding as authoritative for the duration of the migration and MUST ignore Scheduler decisions that would collapse the binding back to the normal single-cluster placement implied by `maxGroups: 1`. In other words, the Scheduler may continue to compute its preferred placement, but the controller MUST NOT apply a scheduling update that removes either the source or target cluster until the migration has reached a terminal state.

Cleanup logic:

- Success path: remove `rollout.karmada.io/migration-protect` only after the source workload is confirmed suspended or terminated, the target workload is confirmed healthy/running, and the binding has been reconciled back to the desired steady-state single-cluster placement.

- Rollback path: if target startup fails, restore the original single-cluster placement on the source cluster first, then remove `rollout.karmada.io/migration-protect`.

- While the finalizer is present, any Scheduler recommendation that conflicts with the temporary dual-cluster expansion is deferred rather than applied.

```yaml

# Internal ResourceBinding Patch

metadata:

finalizers:

- rollout.karmada.io/migration-protect

spec:

clusters:

- name: spaas-kaas-tt-dev02 # Source

replicas: 1

- name: spaas-kaas-pw-dev02 # Target

Copilot · 2026-04-05T19:59:21Z

+### Transition State Machine
+| Phase | Action | Visibility |
+| :--- | :--- | :--- |
+| **1. Trigger** | User taints a cluster or adds a migration annotation. | User-Visible |
+| **2. Discovery** | Controller identifies active cluster via `ResourceBinding` status. | Transparent |
+| **3. Expansion** | Controller patches `ResourceBinding` (replicas: 2) and adds a Finalizer. | Transparent |
+| **4. Hold** | Controller applies `ClusterOverridePolicy` to Target (state: `suspended`). | Transparent |
+| **5. Capture** | Controller patches Source to `suspended`, triggers Savepoint. | Transparent |
+| **6. Handoff** | Controller injects Savepoint URL into Target Override and flips to `running`. | Transparent |
+| **7. Cleanup** | Controller removes Source from `ResourceBinding` and deletes Overrides. | Transparent |


In the Transition State Machine table, Phase 2 "Discovery" identifies the active cluster via "ResourceBinding status," but Phase 3 "Expansion" modifies the ResourceBinding to include both source and target clusters (replicas: 2). How does the controller distinguish between the current/source cluster (running the job) and the target cluster (where to migrate) during the Discovery phase if both are listed in ResourceBinding.spec.clusters?

Copilot · 2026-04-05T19:59:21Z

+# Federated Stateful Rollout: Coordinated Blue-Green Migration for Flink
+
+## Summary
+The **Federated Stateful Rollout** feature introduces a proactive orchestration mechanism for stateful workloads across multiple clusters. While the existing `StatefulFailover` handles unplanned outages (reactive), this feature manages planned operations such as regional rebalancing, cluster maintenance, and safe image upgrades. By coordinating a "Suspend-Capture-Resume" lifecycle, it ensures **Zero Data Loss** and eliminates **Reprocessing Lag** by utilizing synchronous Savepoints.


There's an apparent contradiction regarding operator scope. The PR description claims the controller is "operator-agnostic" and "leverages the ResourceInterpreter framework to 'read' savepoint metadata from Flink status fields without hardcoding Flink logic." However, the entire proposal document (title, examples, user stories) is exclusively focused on Flink/FlinkDeployment. If the feature is meant to be operator-agnostic, the proposal should include generic abstractions and examples for other streaming frameworks (e.g., Spark).

Copilot · 2026-04-05T19:59:21Z

+    - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.
+


The "Risks and Mitigations" section identifies a critical risk: "Split-Brain - Multiple clusters writing to the same sink." However, the proposed mitigation ("Strict 'Suspend-before-Resume' sequence confirmed via ResourceInterpreter status aggregation") lacks concrete detail. How exactly does the ResourceInterpreter verify that the source is fully suspended before the target resumes? What happens if there's a race condition or network delay during the transition?

Suggested change

- Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.

- Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.

The handoff is fail-closed:

1. The controller first applies the source override (`/spec/job/state: suspended`) and records the source object's `metadata.generation`.

2. `ResourceInterpreter` reports the source as safe-to-handoff only when the source-cluster status has reconciled that exact intent, i.e. `status.observedGeneration >= metadata.generation` and the workload-specific status indicates the job is fully suspended/quiesced (for Flink, no running job remains for that deployment and no further writes are expected from the source).

3. Only after that fresh source acknowledgement is aggregated does the controller remove the suspend on the target cluster.

4. If the aggregated status is missing, stale, contradictory, or does not arrive before timeout, the controller keeps the target suspended and retries or aborts the migration; it must never resume the target on assumption alone.

- Race condition / network delay handling: status is evaluated against generation, not wall-clock ordering. A delayed "suspended" report for an older generation is ignored, and loss of connectivity to the source cluster is treated as "not yet safe to resume target". This guarantees delayed reconciliation can only prolong the handoff, not produce dual writers.

Copilot · 2026-04-05T19:59:22Z

+- Manual Scripting: Rejected as error-prone and unsafe for Exactly-Once requirements.
+
+- New Federated CRD: Rejected to avoid API sprawl. Using standard `FlinkDeployment` + `Karmada` Overrides is more sustainable.
+


The proposal document is missing a "Test Plan" section, which is typically included in KEP proposals to outline the testing strategy. While it's noted as "Not required until targeted at a release," including at least a basic test plan would strengthen the proposal and clarify how the feature will be validated.

Suggested change

## Test Plan

- Unit Tests:

- Validate the controller state machine for Suspend-before-Resume sequencing.

- Verify ResourceBinding expansion and finalizer handling during migration.

- Verify rollback behavior when target cluster initialization or restore fails.

- Integration Tests:

- Simulate planned migration between source and target clusters using a Savepoint and verify successful resume in the target cluster.

- Verify that only one active workload writes to the sink at any time to prevent split-brain.

- Validate cleanup of temporary ResourceBinding changes after migration completion or abort.

- Manual / End-to-End Validation:

- Perform a cluster maintenance scenario and confirm the workload is resumed in the target cluster with expected state continuity.

- Perform an image upgrade scenario and confirm automatic rollback when the target deployment fails to become healthy.

- Confirm status conditions and events clearly reflect migration progress, success, and failure states.

codecov-commenter · 2026-04-05T20:12:34Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.07%. Comparing base (6012227) to head (8c7c45f).
⚠️ Report is 58 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7367      +/-   ##
==========================================
+ Coverage   42.04%   42.07%   +0.02%     
==========================================
  Files         874      874              
  Lines       53544    53545       +1     
==========================================
+ Hits        22515    22529      +14     
+ Misses      29341    29320      -21     
- Partials     1688     1696       +8

Flag	Coverage Δ
unittests	`42.07% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ration) Signed-off-by: liwang0513 <liwang0513@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: liwang0513 <liwang0513@gmail.com>

karmada-bot · 2026-04-07T06:47:34Z

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

8c7c45f Update docs/proposals/rollout/federated-stateful-rollout.md

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Copilot AI review requested due to automatic review settings April 5, 2026 19:56

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 5, 2026

karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 5, 2026

karmada-bot requested review from Poor12 and Tingtal April 5, 2026 19:56

Copilot started reviewing on behalf of liwang0513 April 5, 2026 19:56 View session

gemini-code-assist Bot reviewed Apr 5, 2026

View reviewed changes

Copilot AI reviewed Apr 5, 2026

View reviewed changes

KEP: introduce Federated Stateful Rollout (Coordinated Blue-Green Mig…

b71da42

…ration) Signed-off-by: liwang0513 <liwang0513@gmail.com>

liwang0513 force-pushed the lwang/add_stateful_rollout_proposal branch from 1b4f89e to b71da42 Compare April 7, 2026 06:25

karmada-bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 7, 2026

Update docs/proposals/rollout/federated-stateful-rollout.md

8c7c45f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: liwang0513 <liwang0513@gmail.com>

liwang0513 force-pushed the lwang/add_stateful_rollout_proposal branch from 73a2a05 to 8c7c45f Compare April 7, 2026 06:47


		## Design Details

		### The "Hold" Pattern via ClusterOverridePolicy

-In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window.
-```yaml
-# Internal ResourceBinding Patch
-spec:
-  clusters:
-    - name: spaas-kaas-tt-dev02 (Source)
-      replicas: 1
-    - name: spaas-kaas-pw-dev02 (Target)
+In environments with `maxGroups: 1`, the controller must manually expand the `ResourceBinding` to allow coexistence during the handoff. To make this behavior explicit and deterministic, the controller adds the finalizer `rollout.karmada.io/migration-protect` to the `ResourceBinding` before patching `spec.clusters`.
+While `rollout.karmada.io/migration-protect` is present, the rollout controller treats the manually expanded binding as authoritative for the duration of the migration and MUST ignore Scheduler decisions that would collapse the binding back to the normal single-cluster placement implied by `maxGroups: 1`. In other words, the Scheduler may continue to compute its preferred placement, but the controller MUST NOT apply a scheduling update that removes either the source or target cluster until the migration has reached a terminal state.
+Cleanup logic:
+- Success path: remove `rollout.karmada.io/migration-protect` only after the source workload is confirmed suspended or terminated, the target workload is confirmed healthy/running, and the binding has been reconciled back to the desired steady-state single-cluster placement.
+- Rollback path: if target startup fails, restore the original single-cluster placement on the source cluster first, then remove `rollout.karmada.io/migration-protect`.
+- While the finalizer is present, any Scheduler recommendation that conflicts with the temporary dual-cluster expansion is deferred rather than applied.
+```yaml
+# Internal ResourceBinding Patch
+metadata:
+  finalizers:
+    - rollout.karmada.io/migration-protect
+spec:
+  clusters:
+    - name: spaas-kaas-tt-dev02 # Source
+      replicas: 1
+    - name: spaas-kaas-pw-dev02 # Target

		- Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.

-    - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.
+    - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation.
+      The handoff is fail-closed:
+. The controller first applies the source override (`/spec/job/state: suspended`) and records the source object's `metadata.generation`.
+. `ResourceInterpreter` reports the source as safe-to-handoff only when the source-cluster status has reconciled that exact intent, i.e. `status.observedGeneration >= metadata.generation` and the workload-specific status indicates the job is fully suspended/quiesced (for Flink, no running job remains for that deployment and no further writes are expected from the source).
+. Only after that fresh source acknowledgement is aggregated does the controller remove the suspend on the target cluster.
+. If the aggregated status is missing, stale, contradictory, or does not arrive before timeout, the controller keeps the target suspended and retries or aborts the migration; it must never resume the target on assumption alone.
+    - Race condition / network delay handling: status is evaluated against generation, not wall-clock ordering. A delayed "suspended" report for an older generation is ignored, and loss of connectivity to the source cluster is treated as "not yet safe to resume target". This guarantees delayed reconciliation can only prolong the handoff, not produce dual writers.

		- Manual Scripting: Rejected as error-prone and unsafe for Exactly-Once requirements.

		- New Federated CRD: Rejected to avoid API sprawl. Using standard `FlinkDeployment` + `Karmada` Overrides is more sustainable.

+## Test Plan
+- Unit Tests:
+  - Validate the controller state machine for Suspend-before-Resume sequencing.
+  - Verify ResourceBinding expansion and finalizer handling during migration.
+  - Verify rollback behavior when target cluster initialization or restore fails.
+- Integration Tests:
+  - Simulate planned migration between source and target clusters using a Savepoint and verify successful resume in the target cluster.
+  - Verify that only one active workload writes to the sink at any time to prevent split-brain.
+  - Validate cleanup of temporary ResourceBinding changes after migration completion or abort.
+- Manual / End-to-End Validation:
+  - Perform a cluster maintenance scenario and confirm the workload is resumed in the target cluster with expected state continuity.
+  - Perform an image upgrade scenario and confirm automatic rollback when the target deployment fails to become healthy.
+  - Confirm status conditions and events clearly reflect migration progress, success, and failure states.

Conversation

liwang0513 commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot commented Apr 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

karmada-bot commented Apr 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

karmada-bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

liwang0513 commented Apr 5, 2026 •

edited

Loading

codecov-commenter commented Apr 5, 2026 •

edited

Loading