|
| 1 | +--- |
| 2 | +title: "KEP: Federated Stateful Rollout (Coordinated Blue-Green Migration)" |
| 3 | +authors: |
| 4 | + - "@liwang0513" |
| 5 | +reviewers: |
| 6 | + - "@RainbowMango" |
| 7 | + - "@XiShanYongYe-Chang" |
| 8 | + - "@zhzhuang-zju" |
| 9 | +approvers: |
| 10 | + - "@RainbowMango" |
| 11 | +creation-date: 2026-04-04 |
| 12 | +--- |
| 13 | + |
| 14 | +# Federated Stateful Rollout: Coordinated Blue-Green Migration for Flink |
| 15 | + |
| 16 | +## Summary |
| 17 | +The **Federated Stateful Rollout** feature introduces a proactive orchestration mechanism for stateful workloads across multiple clusters. While the existing `StatefulFailover` handles unplanned outages (reactive), this feature manages planned operations such as regional rebalancing, cluster maintenance, and safe image upgrades. By coordinating a "Suspend-Capture-Resume" lifecycle, it ensures **Zero Data Loss** and eliminates **Reprocessing Lag** by utilizing synchronous Savepoints. |
| 18 | + |
| 19 | +## Motivation |
| 20 | +Standard multi-cluster failover in Karmada currently faces three technical gaps for streaming applications: |
| 21 | +* **Reprocessing Lag:** Recovery from stale periodic checkpoints forces jobs to "catch up" on data backlogs, causing downstream latency. |
| 22 | +* **Topology Friction:** Image or DAG updates often break compatibility with old checkpoints; coordinated Savepoints are required for safe upgrades. |
| 23 | +* **Safety Gap:** There is no "Validation Gate." In standard failover, the source instance is often deleted before the target is confirmed healthy. |
| 24 | + |
| 25 | +### Goals |
| 26 | +* **Coordinated Handoff:** Ensure an atomic "baton pass" of state between clusters. |
| 27 | +* **Zero Reprocessing:** Use synchronous Savepoints to start the target exactly where the source stopped. |
| 28 | +* **Validation Gate:** Keep the source cluster as a "hot standby" until the target is `RUNNING`. |
| 29 | +* **Transparent Orchestration:** Automate the manipulation of `ResourceBindings` and `Overrides` within ClusterSets. |
| 30 | + |
| 31 | +### Non-Goals |
| 32 | +* Replacing reactive `StatefulFailover`. |
| 33 | +* Managing underlying storage (S3/GCS) bucket permissions. |
| 34 | + |
| 35 | +## Proposal: WorkloadTransitionController |
| 36 | +We propose a new controller in `karmada-controller-manager` that orchestrates the migration state machine. |
| 37 | + |
| 38 | +### Transition State Machine |
| 39 | +| Phase | Action | Visibility | |
| 40 | +| :--- | :--- | :--- | |
| 41 | +| **1. Trigger** | User taints a cluster or adds a migration annotation. | User-Visible | |
| 42 | +| **2. Discovery** | Controller identifies active cluster via `ResourceBinding` status. | Transparent | |
| 43 | +| **3. Expansion** | Controller patches `ResourceBinding` (replicas: 2) and adds a Finalizer. | Transparent | |
| 44 | +| **4. Hold** | Controller applies `ClusterOverridePolicy` to Target (state: `suspended`). | Transparent | |
| 45 | +| **5. Capture** | Controller patches Source to `suspended`, triggers Savepoint. | Transparent | |
| 46 | +| **6. Handoff** | Controller injects Savepoint URL into Target Override and flips to `running`. | Transparent | |
| 47 | +| **7. Cleanup** | Controller removes Source from `ResourceBinding` and deletes Overrides. | Transparent | |
| 48 | + |
| 49 | +## Design Details |
| 50 | + |
| 51 | +### The "Hold" Pattern via ClusterOverridePolicy |
| 52 | +To ensure the target cluster does not start prematurely, the controller utilizes a `ClusterOverridePolicy` to "hold" the deployment in a suspended state while the `ResourceBinding` is expanded. |
| 53 | + |
| 54 | +**Example Hold Override:** |
| 55 | +```yaml |
| 56 | +apiVersion: policy.karmada.io/v1alpha1 |
| 57 | +kind: ClusterOverridePolicy |
| 58 | +metadata: |
| 59 | + name: flink-migration-hold-pw |
| 60 | +spec: |
| 61 | + resourceSelectors: |
| 62 | + - apiVersion: flink.apache.org/v1beta1 |
| 63 | + kind: FlinkDeployment |
| 64 | + name: hbase-demo |
| 65 | + namespace: s-spaasapi |
| 66 | + targetCluster: |
| 67 | + clusterNames: ["spaas-kaas-pw-dev02"] |
| 68 | + overriders: |
| 69 | + plaintext: |
| 70 | + - path: "/spec/job/state" |
| 71 | + operator: replace |
| 72 | + value: "suspended" |
| 73 | +``` |
| 74 | +### ResourceBinding Manipulation |
| 75 | +In environments with maxGroups: 1, the controller must manually expand the ResourceBinding to allow coexistence during the handoff. A Migration Finalizer is added to prevent the Scheduler from reverting the expansion during the transition window. |
| 76 | +
|
| 77 | +```yaml |
| 78 | +# Internal ResourceBinding Patch |
| 79 | +spec: |
| 80 | + clusters: |
| 81 | + - name: spaas-kaas-tt-dev02 (Source) |
| 82 | + replicas: 1 |
| 83 | + - name: spaas-kaas-pw-dev02 (Target) |
| 84 | + replicas: 1 |
| 85 | + replicas: 2 |
| 86 | +``` |
| 87 | +
|
| 88 | +## User Stories |
| 89 | +### Story 1: Planned Cluster Maintenance (0 RPO) |
| 90 | +An SRE taints cluster `tt` for a Kubernetes upgrade. The controller detects the intent, captures a synchronous Savepoint in `tt`, and hands it to cluster `pw`. The job resumes in `pw` with zero backlog, maintaining real-time processing. |
| 91 | + |
| 92 | +### Story 2: Atomic Image Upgrade |
| 93 | +A developer updates the `FlinkDeployment` image. The controller orchestrates a Blue-Green move. If the new image fails to initialize in the target cluster, the controller aborts and resumes the original job in the source cluster, providing an automated safety net. |
| 94 | + |
| 95 | +## Risks and Mitigations |
| 96 | +- Risk: Split-Brain. Multiple clusters writing to the same sink. |
| 97 | + |
| 98 | + - Mitigation: Strict "Suspend-before-Resume" sequence confirmed via `ResourceInterpreter` status aggregation. |
| 99 | + |
| 100 | +## Alternatives Considered |
| 101 | +- Manual Scripting: Rejected as error-prone and unsafe for Exactly-Once requirements. |
| 102 | + |
| 103 | +- New Federated CRD: Rejected to avoid API sprawl. Using standard `FlinkDeployment` + `Karmada` Overrides is more sustainable. |
| 104 | + |
0 commit comments