KEP: introduce Federated Stateful Rollout (Coordinated Blue-Green Migration)#7367
KEP: introduce Federated Stateful Rollout (Coordinated Blue-Green Migration)#7367liwang0513 wants to merge 2 commits intokarmada-io:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the design proposal for a new Federated Stateful Rollout feature within the Karmada ecosystem. The primary objective is to enable coordinated, stateful migrations of workloads across clusters, specifically targeting streaming applications like Apache Flink. By automating the transition lifecycle, the proposal aims to ensure zero data loss and eliminate reprocessing lag during planned maintenance or upgrades. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request proposes a Federated Stateful Rollout mechanism for coordinated blue-green migrations of stateful workloads, specifically targeting Flink applications. The feedback suggests standardizing the controller name, providing detailed examples for state handoff, and utilizing the ResourceInterpreter framework for operator-agnosticism. Additionally, the reviewer recommends using idiomatic scheduling suspension instead of custom finalizers and requests clarification on the replica count logic in ResourceBinding patches to prevent unintended scaling.
| | **3. Expansion** | Controller patches `ResourceBinding` (replicas: 2) and adds a Finalizer. | Transparent | | ||
| | **4. Hold** | Controller applies `ClusterOverridePolicy` to Target (state: `suspended`). | Transparent | | ||
| | **5. Capture** | Controller patches Source to `suspended`, triggers Savepoint. | Transparent | | ||
| | **6. Handoff** | Controller injects Savepoint URL into Target Override and flips to `running`. | Transparent | |
There was a problem hiding this comment.
Phase 6 of the state machine mentions injecting the Savepoint URL into the Target Override. It would be beneficial to include a concrete example of how this injection is structured in the ClusterOverridePolicy within the 'Design Details' section, especially since stateful workloads like Flink typically require specific fields (e.g., initialSavepointPath) to be set for resuming from a state.
|
|
||
| ## Design Details | ||
|
|
||
| ### The "Hold" Pattern via ClusterOverridePolicy |
There was a problem hiding this comment.
The proposal states that the controller is operator-agnostic and leverages the ResourceInterpreter framework. However, the 'Hold' pattern example (lines 55-73) uses Flink-specific paths (/spec/job/state) and values (suspended). To truly achieve operator-agnosticism, the controller should ideally obtain these 'suspend/resume' patches from the ResourceInterpreter or a configurable template, rather than having them hardcoded or specifically tailored for Flink in the core logic.
| value: "suspended" | ||
| ``` | ||
| ### ResourceBinding Manipulation | ||
| In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window. |
There was a problem hiding this comment.
Using a 'Migration Finalizer' to prevent the scheduler from reverting ResourceBinding changes is a non-standard use of finalizers, which are typically for blocking deletion. A more idiomatic approach in Karmada would be to use the spec.suspension.scheduling field in the ResourceBinding to pause scheduler reconciliation during the migration window. This avoids the need for custom scheduler logic to respect a specific finalizer.
There was a problem hiding this comment.
Pull request overview
This pull request introduces a Kubernetes Enhancement Proposal (KEP) for "Federated Stateful Rollout," a feature that enables coordinated, zero-downtime migrations of stateful workloads (specifically Apache Flink) across multiple Karmada clusters. The proposal describes a 7-phase state machine that automates the "Suspend-Capture-Resume" lifecycle, leveraging synchronous Savepoints to preserve state during migrations while maintaining a "hot standby" on the source cluster until the target is confirmed healthy.
Changes:
- Introduces a new proposal document outlining the Federated Stateful Rollout feature design
- Describes a state machine-based orchestration approach using ResourceBinding manipulation and ClusterOverridePolicy
- Specifies migration trigger mechanisms (cluster taints and annotations) and state transition phases
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ## Design Details | ||
|
|
There was a problem hiding this comment.
The document is missing important sections commonly found in KEP proposals. Specifically: 1) An API Design section showing concrete changes to CRDs or new resource types, 2) Implementation Details explaining how the ResourceInterpreter will be leveraged for state discovery, 3) An explicit feature gate definition for StatefulMigration. While the high-level design is clear, these details are important for implementation and review.
| A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster, the controller aborts and resumes the original job in the source cluster, providing an automated safety net. | ||
|
|
||
| ## Risks and Mitigations | ||
| - Risk: Split-Brain. Multiple clusters writing to the same sink. | ||
|
|
||
| - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. |
There was a problem hiding this comment.
In Story 2 (Atomic Image Upgrade), the document states the controller will "abort and resumes the original job in the source cluster" if the new image fails. However, the 7-phase state machine in the Transition State Machine table does not include an abort/rollback phase. How is the rollback sequence implemented if the target cluster fails after the Hold override is applied?
| A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster, the controller aborts and resumes the original job in the source cluster, providing an automated safety net. | |
| ## Risks and Mitigations | |
| - Risk: Split-Brain. Multiple clusters writing to the same sink. | |
| - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. | |
| A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster after the source cluster has already been placed in `suspended` state, the controller performs an explicit abort sequence rather than advancing to cutover: (1) keep the source job suspended, (2) mark the target startup/restore as failed, (3) delete or scale down the target workload so no writer remains active there, (4) remove the temporary Hold/Suspend override from the source (or restore the pre-migration source spec), and then (5) resume the original job in the source cluster. This rollback is an error-transition from target bring-up/validation, not a separate steady-state phase, and it preserves the same single-writer invariant as the forward path. | |
| ## Risks and Mitigations | |
| - Risk: Split-Brain. Multiple clusters writing to the same sink. | |
| - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. | |
| - Mitigation: Rollback follows the inverse ordering: "Target-stopped-before-Source-resume". The controller must observe the target as failed and inactive before removing the source suspend override. |
| In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window. | ||
|
|
||
| ```yaml | ||
| # Internal ResourceBinding Patch | ||
| spec: | ||
| clusters: | ||
| - name: spaas-kaas-tt-dev02 (Source) | ||
| replicas: 1 | ||
| - name: spaas-kaas-pw-dev02 (Target) |
There was a problem hiding this comment.
The ResourceBinding Manipulation section mentions that the controller manually expands the ResourceBinding and adds a "Migration Finalizer". However, the concrete finalizer name and its cleanup logic are not defined. What is the exact finalizer name and how does it interact with the Scheduler's scheduling decisions?
| In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window. | |
| ```yaml | |
| # Internal ResourceBinding Patch | |
| spec: | |
| clusters: | |
| - name: spaas-kaas-tt-dev02 (Source) | |
| replicas: 1 | |
| - name: spaas-kaas-pw-dev02 (Target) | |
| In environments with `maxGroups: 1`, the controller must manually expand the `ResourceBinding` to allow coexistence during the handoff. To make this behavior explicit and deterministic, the controller adds the finalizer `rollout.karmada.io/migration-protect` to the `ResourceBinding` before patching `spec.clusters`. | |
| While `rollout.karmada.io/migration-protect` is present, the rollout controller treats the manually expanded binding as authoritative for the duration of the migration and MUST ignore Scheduler decisions that would collapse the binding back to the normal single-cluster placement implied by `maxGroups: 1`. In other words, the Scheduler may continue to compute its preferred placement, but the controller MUST NOT apply a scheduling update that removes either the source or target cluster until the migration has reached a terminal state. | |
| Cleanup logic: | |
| - Success path: remove `rollout.karmada.io/migration-protect` only after the source workload is confirmed suspended or terminated, the target workload is confirmed healthy/running, and the binding has been reconciled back to the desired steady-state single-cluster placement. | |
| - Rollback path: if target startup fails, restore the original single-cluster placement on the source cluster first, then remove `rollout.karmada.io/migration-protect`. | |
| - While the finalizer is present, any Scheduler recommendation that conflicts with the temporary dual-cluster expansion is deferred rather than applied. | |
| ```yaml | |
| # Internal ResourceBinding Patch | |
| metadata: | |
| finalizers: | |
| - rollout.karmada.io/migration-protect | |
| spec: | |
| clusters: | |
| - name: spaas-kaas-tt-dev02 # Source | |
| replicas: 1 | |
| - name: spaas-kaas-pw-dev02 # Target |
| ### Transition State Machine | ||
| | Phase | Action | Visibility | | ||
| | :--- | :--- | :--- | | ||
| | **1. Trigger** | User taints a cluster or adds a migration annotation. | User-Visible | | ||
| | **2. Discovery** | Controller identifies active cluster via `ResourceBinding` status. | Transparent | | ||
| | **3. Expansion** | Controller patches `ResourceBinding` (replicas: 2) and adds a Finalizer. | Transparent | | ||
| | **4. Hold** | Controller applies `ClusterOverridePolicy` to Target (state: `suspended`). | Transparent | | ||
| | **5. Capture** | Controller patches Source to `suspended`, triggers Savepoint. | Transparent | | ||
| | **6. Handoff** | Controller injects Savepoint URL into Target Override and flips to `running`. | Transparent | | ||
| | **7. Cleanup** | Controller removes Source from `ResourceBinding` and deletes Overrides. | Transparent | |
There was a problem hiding this comment.
In the Transition State Machine table, Phase 2 "Discovery" identifies the active cluster via "ResourceBinding status," but Phase 3 "Expansion" modifies the ResourceBinding to include both source and target clusters (replicas: 2). How does the controller distinguish between the current/source cluster (running the job) and the target cluster (where to migrate) during the Discovery phase if both are listed in ResourceBinding.spec.clusters?
| # Federated Stateful Rollout: Coordinated Blue-Green Migration for Flink | ||
|
|
||
| ## Summary | ||
| The **Federated Stateful Rollout** feature introduces a proactive orchestration mechanism for stateful workloads across multiple clusters. While the existing `StatefulFailover` handles unplanned outages (reactive), this feature manages planned operations such as regional rebalancing, cluster maintenance, and safe image upgrades. By coordinating a "Suspend-Capture-Resume" lifecycle, it ensures **Zero Data Loss** and eliminates **Reprocessing Lag** by utilizing synchronous Savepoints. |
There was a problem hiding this comment.
There's an apparent contradiction regarding operator scope. The PR description claims the controller is "operator-agnostic" and "leverages the ResourceInterpreter framework to 'read' savepoint metadata from Flink status fields without hardcoding Flink logic." However, the entire proposal document (title, examples, user stories) is exclusively focused on Flink/FlinkDeployment. If the feature is meant to be operator-agnostic, the proposal should include generic abstractions and examples for other streaming frameworks (e.g., Spark).
| - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. | ||
|
|
There was a problem hiding this comment.
The "Risks and Mitigations" section identifies a critical risk: "Split-Brain - Multiple clusters writing to the same sink." However, the proposed mitigation ("Strict 'Suspend-before-Resume' sequence confirmed via ResourceInterpreter status aggregation") lacks concrete detail. How exactly does the ResourceInterpreter verify that the source is fully suspended before the target resumes? What happens if there's a race condition or network delay during the transition?
| - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. | |
| - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. | |
| The handoff is fail-closed: | |
| 1. The controller first applies the source override (`/spec/job/state: suspended`) and records the source object's `metadata.generation`. | |
| 2. `ResourceInterpreter` reports the source as safe-to-handoff only when the source-cluster status has reconciled that exact intent, i.e. `status.observedGeneration >= metadata.generation` and the workload-specific status indicates the job is fully suspended/quiesced (for Flink, no running job remains for that deployment and no further writes are expected from the source). | |
| 3. Only after that fresh source acknowledgement is aggregated does the controller remove the suspend on the target cluster. | |
| 4. If the aggregated status is missing, stale, contradictory, or does not arrive before timeout, the controller keeps the target suspended and retries or aborts the migration; it must never resume the target on assumption alone. | |
| - Race condition / network delay handling: status is evaluated against generation, not wall-clock ordering. A delayed "suspended" report for an older generation is ignored, and loss of connectivity to the source cluster is treated as "not yet safe to resume target". This guarantees delayed reconciliation can only prolong the handoff, not produce dual writers. |
| - Manual Scripting: Rejected as error-prone and unsafe for Exactly-Once requirements. | ||
|
|
||
| - New Federated CRD: Rejected to avoid API sprawl. Using standard `FlinkDeployment` + `Karmada` Overrides is more sustainable. | ||
|
|
There was a problem hiding this comment.
The proposal document is missing a "Test Plan" section, which is typically included in KEP proposals to outline the testing strategy. While it's noted as "Not required until targeted at a release," including at least a basic test plan would strengthen the proposal and clarify how the feature will be validated.
| ## Test Plan | |
| - Unit Tests: | |
| - Validate the controller state machine for Suspend-before-Resume sequencing. | |
| - Verify ResourceBinding expansion and finalizer handling during migration. | |
| - Verify rollback behavior when target cluster initialization or restore fails. | |
| - Integration Tests: | |
| - Simulate planned migration between source and target clusters using a Savepoint and verify successful resume in the target cluster. | |
| - Verify that only one active workload writes to the sink at any time to prevent split-brain. | |
| - Validate cleanup of temporary ResourceBinding changes after migration completion or abort. | |
| - Manual / End-to-End Validation: | |
| - Perform a cluster maintenance scenario and confirm the workload is resumed in the target cluster with expected state continuity. | |
| - Perform an image upgrade scenario and confirm automatic rollback when the target deployment fails to become healthy. | |
| - Confirm status conditions and events clearly reflect migration progress, success, and failure states. |
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7367 +/- ##
==========================================
+ Coverage 42.04% 42.07% +0.02%
==========================================
Files 874 874
Lines 53544 53545 +1
==========================================
+ Hits 22515 22529 +14
+ Misses 29341 29320 -21
- Partials 1688 1696 +8
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ration) Signed-off-by: liwang0513 <liwang0513@gmail.com>
1b4f89e to
b71da42
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: liwang0513 <liwang0513@gmail.com>
73a2a05 to
8c7c45f
Compare
|
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR introduces the
StatefulMigrationControllerwithin thekarmada-controller-managerto enable coordinated, stateful migrations for stateful applications (Apache Flink as an example).Why we need it:
ResourceBindingsandOverrides.RUNNING, providing a safety net for critical stateful jobs.Which issue(s) this PR fixes:
Fixes #7291
Special notes for your reviewer:
ResourceInterpreterframework to "read" savepoint metadata from Flink status fields without hardcoding Flink logic into the core.maxGroups: 1constraints during migration, the controller temporarily expands theResourceBinding.spec.clusterslist. A Migration Finalizer is implemented to prevent the Scheduler from reverting this expansion until the handoff is complete.ClusterOverridePolicyto keep the target instancesuspendeduntil the captured Savepoint URL is injected.Does this PR introduce a user-facing change?:
Introduced the
Federated Stateful Rolloutfeature (protected by theStatefulMigrationfeature gate). This allows users to perform coordinated, zero-backlog migrations of Flink workloads across clusters by using thekarmada.io/migration-targetannotation or cluster taints.