scheduler: requeue unschedulable bindings on cluster status changes#7369
scheduler: requeue unschedulable bindings on cluster status changes#7369Tej-Katika wants to merge 1 commit intokarmada-io:masterfrom
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical issue where the scheduler's Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
e9e4414 to
a2fea2f
Compare
There was a problem hiding this comment.
Code Review
This pull request updates the scheduler's event handler to trigger cluster reconciliation when a cluster is added or when specific status fields, such as ResourceSummary, Conditions, and APIEnablements, are modified. It also includes comprehensive updates to the unit tests to verify these new reconciliation triggers. A review comment suggests that triggering a full reconciliation on every ResourceSummary update could lead to performance issues in large-scale environments, recommending that these updates be throttled or made more selective to reduce overhead.
| case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary) || | ||
| !equality.Semantic.DeepEqual(oldCluster.Status.Conditions, newCluster.Status.Conditions) || | ||
| !equality.Semantic.DeepEqual(oldCluster.Status.APIEnablements, newCluster.Status.APIEnablements): | ||
| s.clusterReconcileWorker.Add(newCluster) |
There was a problem hiding this comment.
Triggering a full reconciliation of all bindings on every ResourceSummary update might lead to performance overhead in large-scale environments. In Karmada, ResourceSummary (specifically the Allocated field) can change frequently as pods are scheduled or removed in member clusters. Each such update now triggers enqueueAffectedBindings, which performs a full list and scan of all ResourceBinding and ClusterResourceBinding objects.
Consider if it's possible to:
- Throttle these updates in the
clusterReconcileWorker. - Be more selective: For example, only trigger a requeue when resources are freed (i.e.,
Allocateddecreases orAllocatableincreases), as those are the cases where an unschedulable binding is most likely to now fit.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7369 +/- ##
==========================================
- Coverage 42.15% 41.94% -0.22%
==========================================
Files 875 879 +4
Lines 53618 54339 +721
==========================================
+ Hits 22602 22790 +188
- Misses 29315 29828 +513
- Partials 1701 1721 +20
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
a2fea2f to
d25796a
Compare
There was a problem hiding this comment.
Pull request overview
This PR implements event-driven requeue for unschedulable bindings when cluster status changes occur, complementing PR #7340 which fixed error type propagation. The changes enable the scheduler to immediately reprocess bindings that are waiting due to cluster resource unavailability or missing API enablements, instead of waiting up to 5 minutes for a timer-based flush.
Changes:
- Modify
addClusterto always queue cluster reconciliation, triggering immediate requeue of affected bindings when a new cluster joins - Add a new status-change case to
updateClusterto detect ResourceSummary, Conditions, or APIEnablements changes and trigger cluster reconciliation - Expand test coverage to validate both estimator and reconcile worker behavior in cluster event handlers
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| pkg/scheduler/event_handler.go | Adds cluster reconciliation to addCluster and status change detection to updateCluster switch statement |
| pkg/scheduler/event_handler_test.go | Refactors and expands tests for cluster event handlers to cover new status change handling and worker interactions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/assign |
seanlaii
left a comment
There was a problem hiding this comment.
Hi @Tej-Katika , thanks for the PR!
/assign
seanlaii
left a comment
There was a problem hiding this comment.
One thing I'm slightly concerned about is the CPU cost when ResourceSummary changes frequently — since it updates every 10s by default and enqueueAffectedBindings does a full scan of all bindings, we could end up with a lot of unnecessary work: listing all bindings, doing affinity matching, and requeueing bindings that will just be popped and short-circuited in doScheduleBinding without any actual scheduling needed.
What do you think about adding a guard so we only trigger reconciliation when there are bindings that could actually benefit from it? Something like:
case !equality.Semantic.DeepEqual(oldCluster.Status.Conditions, newCluster.Status.Conditions) || !equality.Semantic.DeepEqual(oldCluster.Status.APIEnablements, newCluster.Status.APIEnablements):
s.clusterReconcileWorker.Add(newCluster)
case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary):
if features.FeatureGate.Enabled(features.PriorityBasedScheduling) && s.priorityQueue.HasUnschedulableBindings() {
s.clusterReconcileWorker.Add(newCluster)
}
This way Conditions/APIEnablements changes (low frequency, semantically significant) always trigger reconciliation, while ResourceSummary changes (high frequency, often just noise) only trigger it when there are unschedulable bindings that might actually benefit from rescheduling.
|
Also, A few edge cases around switch-case ordering might be worth adding to guard against future regressions:
These would catch accidental reordering of the switch cases in the future. |
| for _, tt := range tests { | ||
| t.Run(tt.name, func(t *testing.T) { | ||
| mockWorker := &mockAsyncWorker{} | ||
| estimatorWorker := &mockAsyncWorker{} |
There was a problem hiding this comment.
Could you elaborate the reason of validating estimator here?
There was a problem hiding this comment.
It seems that this is not related to this PR, but it is a good enhancement. Maybe we can separate this to a new PR. WDYT?
There was a problem hiding this comment.
The original TestAddCluster already validated the estimator worker using a single mockWorker. When this PR introduced clusterReconcileWorker.Add(cluster) in addCluster, I needed a separate mock for each worker to independently assert their call counts — otherwise the counts would be conflated across both workers. Keeping the estimator assertions in the same test ensures the refactoring didn't accidentally break the existing estimator path while adding the reconcile path. It tests the full behavior of addCluster in one place rather than leaving the estimator path uncovered.
There was a problem hiding this comment.
Yeah, that's a fair point. The worker split was structurally necessary to test the new reconcile call independently, but the new estimator else-branch assertions (checking addCount == 0 when estimator is disabled) are purely defensive additions that weren't in the original test. I can strip those back out to keep the diff tightly scoped to this PR. I'll revert the estimator side to match the original assertion style, and only keep what's needed to validate the new clusterReconcileWorker.Add behavior.
|
cc @RainbowMango @zhzhuang-zju to take a look as well. Thanks! |
|
ResourceSummary is updated by the cluster-status-controller on every sync cycle (~10s by default), so unconditionally running enqueueAffectedBindings on each update would be expensive in large-scale clusters with many bindings. // on prioritySchedulingQueue: case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary): |
Adding
Make sense to me. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
9a1096f to
8c15034
Compare
|
cc @seanlaii @RainbowMango @zhzhuang-zju Please take a look. Thanks! |
| if s.enableSchedulerEstimator { | ||
| s.schedulerEstimatorWorker.Add(cluster.Name) | ||
| } | ||
| s.clusterReconcileWorker.Add(cluster) |
There was a problem hiding this comment.
addCluster is invoked for existing Cluster objects during informer initial cache population, not only for real new-cluster joins. This unconditional enqueue means a scheduler restart queues one cluster reconciliation per existing cluster; after cache sync each item scans all ResourceBindings/ClusterResourceBindings and may requeue matching bindings even though no cluster changed. In large installations this can turn startup into O(clusters * bindings) reconcile work. Consider gating this like the ResourceSummary path, for example only enqueueing when s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings(), or otherwise distinguishing real post-start joins from initial informer replay.
There was a problem hiding this comment.
Addressed in 843cd37b — gated on s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings() and dropped the unconditional reconcile-worker enqueue in favor of MoveAllToActive().
| // Len returns the length of activeQ. | ||
| Len() int | ||
|
|
||
| // HasUnschedulableBindings reports whether the unschedulableBindings sub-queue is non-empty. |
There was a problem hiding this comment.
emm, precise, it is not a queue.
| s.clusterReconcileWorker.Add(newCluster) | ||
| case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary): | ||
| if s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings() { | ||
| s.clusterReconcileWorker.Add(newCluster) |
There was a problem hiding this comment.
can we focus on unschedulableBindings instead of a full scan of all bindings?
There was a problem hiding this comment.
Both the Conditions/APIEnablements case and the ResourceSummary case now call MoveAllToActive() directly, so
we no longer fan out into a full binding scan via clusterReconcileWorker.Add → enqueueAffectedBindings.
465e9fb to
843cd37
Compare
|
@Tej-Katika: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| Len() int | ||
|
|
||
| // HasUnschedulableBindings reports whether the unschedulableBindings map is non-empty. | ||
| HasUnschedulableBindings() bool |
There was a problem hiding this comment.
MoveAllToActive can be called safely, does we still need this?
There was a problem hiding this comment.
Agreed. Since MoveAllToActive() is already safe to call when the unschedulable map is empty, the
HasUnschedulableBindings() guard is redundant. I'll drop the interface method and its implementation, and simplify the three call sites in event_handler.go to just if s.priorityQueue != nil { s.priorityQueue.MoveAllToActive()}.
There was a problem hiding this comment.
Dropped HasUnschedulableBindings() and simplified the three call sites to if s.priorityQueue != nil { s.priorityQueue.MoveAllToActive() }. Updated the docstring on MoveAllToActive to note it's safe to call when the unschedulable map is empty.
When a cluster's status changes (conditions, API enablements, or resource summary), previously unschedulable bindings may now be schedulable. This commit adds event handlers so such changes flush unschedulable bindings directly to activeQ via MoveAllToActive(), avoiding a full scan of all ResourceBindings. - addCluster: call MoveAllToActive() so bindings stuck waiting for a new cluster to join are retried immediately - updateCluster: replace clusterReconcileWorker.Add with MoveAllToActive() for status-change cases; the direct flush is cheaper than enqueueing a full binding scan via reconcileWorker - MoveAllToActive() is safe to call when unschedulableBindings is empty, so no separate guard is needed Signed-off-by: Tejashwar Reddy Katika <tejashwar1029@gmail.com>
843cd37 to
03485e0
Compare
What this PR does / why we need it
When
PriorityBasedSchedulingis enabled, bindings that fail withUnschedulableErrorare placed inunschedulableBindingsand only flushed back toactiveQby a 5-minute timer. This happens becauseupdateClusteronly reacts toClusterSpecchanges —ClusterStatusfields never incrementGenerationso they were silently dropped.Two concrete problems this fixes:
Ready, affected bindings wait up to 5 minutes before retry instead of ~10 seconds."0/N clusters: API resource not found"are never retried at all (workaround today is to manually patchspec.rescheduleTriggeredAt).Changes
updateCluster: adds a newcaseto the existing switch that callsclusterReconcileWorker.Add(newCluster)whenResourceSummary,Conditions, orAPIEnablementschange. This reuses the existingreconcileCluster→enqueueAffectedBindings→priorityQueue.Push→moveToActiveQpath, which already handles moving bindings out ofunschedulableBindings.addCluster: addsclusterReconcileWorker.Add(cluster)so bindings stuck waiting for a new cluster are retried immediately on join rather than after 5 minutes.No changes to the
SchedulingQueueinterface are needed.Which issue(s) this PR fixes
Part of #7344
Prerequisites
#7340 (already merged) — fixes
%wwrapping soerrors.As(err, &unschedulableErr)correctly identifies the error, ensuring bindings land inunschedulableBindingsrather thanbackoffQ.Special notes for your reviewer
The new status-change
caseis intentionally placed after the generation/labels cases in the switch. If spec and status change simultaneously, the generation case fires and the existing path already requeues affected bindings — no double-trigger.Does this PR introduce a user-facing change?