Skip to content

scheduler: requeue unschedulable bindings on cluster status changes#7369

Open
Tej-Katika wants to merge 1 commit intokarmada-io:masterfrom
Tej-Katika:fix/scheduler-requeue-on-cluster-status-change
Open

scheduler: requeue unschedulable bindings on cluster status changes#7369
Tej-Katika wants to merge 1 commit intokarmada-io:masterfrom
Tej-Katika:fix/scheduler-requeue-on-cluster-status-change

Conversation

@Tej-Katika
Copy link
Copy Markdown
Contributor

What this PR does / why we need it

When PriorityBasedScheduling is enabled, bindings that fail with UnschedulableError are placed in unschedulableBindings and only flushed back to activeQ by a 5-minute timer. This happens because updateCluster only reacts to ClusterSpec changes — ClusterStatus fields never increment Generation so they were silently dropped.

Two concrete problems this fixes:

  1. ResourceSummary / Conditions — after a cluster frees capacity or transitions to Ready, affected bindings wait up to 5 minutes before retry instead of ~10 seconds.
  2. APIEnablements — after a CRD is installed on a member cluster, bindings stuck with "0/N clusters: API resource not found" are never retried at all (workaround today is to manually patch spec.rescheduleTriggeredAt).

Changes

  • updateCluster: adds a new case to the existing switch that calls clusterReconcileWorker.Add(newCluster) when ResourceSummary, Conditions, or APIEnablements change. This reuses the existing reconcileClusterenqueueAffectedBindingspriorityQueue.PushmoveToActiveQ path, which already handles moving bindings out of unschedulableBindings.
  • addCluster: adds clusterReconcileWorker.Add(cluster) so bindings stuck waiting for a new cluster are retried immediately on join rather than after 5 minutes.

No changes to the SchedulingQueue interface are needed.

Which issue(s) this PR fixes

Part of #7344

Prerequisites

#7340 (already merged) — fixes %w wrapping so errors.As(err, &unschedulableErr) correctly identifies the error, ensuring bindings land in unschedulableBindings rather than backoffQ.

Special notes for your reviewer

The new status-change case is intentionally placed after the generation/labels cases in the switch. If spec and status change simultaneously, the generation case fires and the existing path already requeues affected bindings — no double-trigger.

Does this PR introduce a user-facing change?

NONE

@karmada-bot karmada-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. labels Apr 8, 2026
@karmada-bot karmada-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 8, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical issue where the scheduler's unschedulableBindings queue was not promptly reacting to changes in cluster status, leading to significant delays in retrying bindings. By introducing logic to trigger cluster reconciliation upon relevant status updates and new cluster additions, the system ensures that bindings are requeued and rescheduled much more efficiently, improving overall scheduling responsiveness and reducing bottlenecks.

Highlights

  • Improved Binding Requeue on Cluster Status Changes: Bindings that become unschedulable due to cluster status changes (e.g., ResourceSummary, Conditions, APIEnablements) will now be requeued immediately instead of waiting for a 5-minute timer, significantly reducing retry delays.
  • Immediate Requeue for Newly Added Clusters: Bindings waiting for a new cluster to join will now be retried immediately upon the cluster's addition, rather than experiencing a delay.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Tej-Katika Tej-Katika force-pushed the fix/scheduler-requeue-on-cluster-status-change branch from e9e4414 to a2fea2f Compare April 8, 2026 00:22
@karmada-bot karmada-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Apr 8, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the scheduler's event handler to trigger cluster reconciliation when a cluster is added or when specific status fields, such as ResourceSummary, Conditions, and APIEnablements, are modified. It also includes comprehensive updates to the unit tests to verify these new reconciliation triggers. A review comment suggests that triggering a full reconciliation on every ResourceSummary update could lead to performance issues in large-scale environments, recommending that these updates be throttled or made more selective to reduce overhead.

Comment thread pkg/scheduler/event_handler.go Outdated
Comment on lines +310 to +313
case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary) ||
!equality.Semantic.DeepEqual(oldCluster.Status.Conditions, newCluster.Status.Conditions) ||
!equality.Semantic.DeepEqual(oldCluster.Status.APIEnablements, newCluster.Status.APIEnablements):
s.clusterReconcileWorker.Add(newCluster)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Triggering a full reconciliation of all bindings on every ResourceSummary update might lead to performance overhead in large-scale environments. In Karmada, ResourceSummary (specifically the Allocated field) can change frequently as pods are scheduled or removed in member clusters. Each such update now triggers enqueueAffectedBindings, which performs a full list and scan of all ResourceBinding and ClusterResourceBinding objects.

Consider if it's possible to:

  1. Throttle these updates in the clusterReconcileWorker.
  2. Be more selective: For example, only trigger a requeue when resources are freed (i.e., Allocated decreases or Allocatable increases), as those are the cases where an unschedulable binding is most likely to now fit.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 8, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 61.53846% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.94%. Comparing base (773dcf0) to head (03485e0).
⚠️ Report is 34 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/internal/queue/scheduling_queue.go 0.00% 5 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7369      +/-   ##
==========================================
- Coverage   42.15%   41.94%   -0.22%     
==========================================
  Files         875      879       +4     
  Lines       53618    54339     +721     
==========================================
+ Hits        22602    22790     +188     
- Misses      29315    29828     +513     
- Partials     1701     1721      +20     
Flag Coverage Δ
unittests 41.94% <61.53%> (-0.22%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Tej-Katika Tej-Katika force-pushed the fix/scheduler-requeue-on-cluster-status-change branch from a2fea2f to d25796a Compare April 8, 2026 01:24
@Tej-Katika Tej-Katika marked this pull request as ready for review April 8, 2026 02:26
Copilot AI review requested due to automatic review settings April 8, 2026 02:26
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements event-driven requeue for unschedulable bindings when cluster status changes occur, complementing PR #7340 which fixed error type propagation. The changes enable the scheduler to immediately reprocess bindings that are waiting due to cluster resource unavailability or missing API enablements, instead of waiting up to 5 minutes for a timer-based flush.

Changes:

  • Modify addCluster to always queue cluster reconciliation, triggering immediate requeue of affected bindings when a new cluster joins
  • Add a new status-change case to updateCluster to detect ResourceSummary, Conditions, or APIEnablements changes and trigger cluster reconciliation
  • Expand test coverage to validate both estimator and reconcile worker behavior in cluster event handlers

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
pkg/scheduler/event_handler.go Adds cluster reconciliation to addCluster and status change detection to updateCluster switch statement
pkg/scheduler/event_handler_test.go Refactors and expands tests for cluster event handlers to cover new status change handling and worker interactions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@whitewindmills
Copy link
Copy Markdown
Member

/assign

Copy link
Copy Markdown
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Tej-Katika , thanks for the PR!
/assign

Copy link
Copy Markdown
Contributor

@seanlaii seanlaii left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'm slightly concerned about is the CPU cost when ResourceSummary changes frequently — since it updates every 10s by default and enqueueAffectedBindings does a full scan of all bindings, we could end up with a lot of unnecessary work: listing all bindings, doing affinity matching, and requeueing bindings that will just be popped and short-circuited in doScheduleBinding without any actual scheduling needed.

What do you think about adding a guard so we only trigger reconciliation when there are bindings that could actually benefit from it? Something like:

case !equality.Semantic.DeepEqual(oldCluster.Status.Conditions, newCluster.Status.Conditions) || !equality.Semantic.DeepEqual(oldCluster.Status.APIEnablements, newCluster.Status.APIEnablements):  
    s.clusterReconcileWorker.Add(newCluster)                                                            
case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary):
    if features.FeatureGate.Enabled(features.PriorityBasedScheduling) && s.priorityQueue.HasUnschedulableBindings() {                                                    
         s.clusterReconcileWorker.Add(newCluster)                                                        
    }     

This way Conditions/APIEnablements changes (low frequency, semantically significant) always trigger reconciliation, while ResourceSummary changes (high frequency, often just noise) only trigger it when there are unschedulable bindings that might actually benefit from rescheduling.

@seanlaii
Copy link
Copy Markdown
Contributor

Also, A few edge cases around switch-case ordering might be worth adding to guard against future regressions:
For example:

  1. Generation + status change simultaneously — generation case should take precedence (expect 2 adds, not 3)
  2. DeletionTimestamp + status change simultaneously — deletion case should take precedence (expect 1 add)
  3. Identical non-nil status — DeepEqual returns true, should not trigger reconcile (expect 0 adds)

These would catch accidental reordering of the switch cases in the future.

for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
mockWorker := &mockAsyncWorker{}
estimatorWorker := &mockAsyncWorker{}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate the reason of validating estimator here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is not related to this PR, but it is a good enhancement. Maybe we can separate this to a new PR. WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original TestAddCluster already validated the estimator worker using a single mockWorker. When this PR introduced clusterReconcileWorker.Add(cluster) in addCluster, I needed a separate mock for each worker to independently assert their call counts — otherwise the counts would be conflated across both workers. Keeping the estimator assertions in the same test ensures the refactoring didn't accidentally break the existing estimator path while adding the reconcile path. It tests the full behavior of addCluster in one place rather than leaving the estimator path uncovered.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair point. The worker split was structurally necessary to test the new reconcile call independently, but the new estimator else-branch assertions (checking addCount == 0 when estimator is disabled) are purely defensive additions that weren't in the original test. I can strip those back out to keep the diff tightly scoped to this PR. I'll revert the estimator side to match the original assertion style, and only keep what's needed to validate the new clusterReconcileWorker.Add behavior.

@seanlaii
Copy link
Copy Markdown
Contributor

cc @RainbowMango @zhzhuang-zju to take a look as well. Thanks!

@Tej-Katika
Copy link
Copy Markdown
Contributor Author

ResourceSummary is updated by the cluster-status-controller on every sync cycle (~10s by default), so unconditionally running enqueueAffectedBindings on each update would be expensive in large-scale clusters with many bindings.
The proposed split makes sense. One thing to flag before implementing it: HasUnschedulableBindings() doesn't currently exist on the SchedulingQueue interface or prioritySchedulingQueue. We'd need to add it — something like:
// in SchedulingQueue interface:
HasUnschedulableBindings() bool

// on prioritySchedulingQueue:
func (bq *prioritySchedulingQueue) HasUnschedulableBindings() bool {
bq.lock.RLock()
defer bq.lock.RUnlock()
return bq.unschedulableBindings.Len() > 0
}
Alternatively, since s.priorityQueue is already nil when PriorityBasedScheduling is disabled (it's only initialized inside the feature gate check in New()), the guard could be written as:

case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary):
if s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings() {
s.clusterReconcileWorker.Add(newCluster)
}
This is equivalent to the feature gate check but avoids importing the features package in event_handler.go. Does adding HasUnschedulableBindings() to the SchedulingQueue interface sound reasonable, or would you prefer a different approach?
I'll update the PR with the split cases + the new method once you confirm the direction.

@seanlaii
Copy link
Copy Markdown
Contributor

seanlaii commented Apr 18, 2026

ResourceSummary is updated by the cluster-status-controller on every sync cycle (~10s by default), so unconditionally running enqueueAffectedBindings on each update would be expensive in large-scale clusters with many bindings. The proposed split makes sense. One thing to flag before implementing it: HasUnschedulableBindings() doesn't currently exist on the SchedulingQueue interface or prioritySchedulingQueue. We'd need to add it — something like: // in SchedulingQueue interface: HasUnschedulableBindings() bool

Adding HasUnschedulableBindings() sounds good to me.

if s.priorityQueue != nil

Make sense to me.

@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from seanlaii. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Tej-Katika Tej-Katika force-pushed the fix/scheduler-requeue-on-cluster-status-change branch 2 times, most recently from 9a1096f to 8c15034 Compare April 18, 2026 20:04
@Tej-Katika
Copy link
Copy Markdown
Contributor Author

cc @seanlaii @RainbowMango @zhzhuang-zju

Please take a look. Thanks!

Comment thread pkg/scheduler/event_handler.go Outdated
if s.enableSchedulerEstimator {
s.schedulerEstimatorWorker.Add(cluster.Name)
}
s.clusterReconcileWorker.Add(cluster)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addCluster is invoked for existing Cluster objects during informer initial cache population, not only for real new-cluster joins. This unconditional enqueue means a scheduler restart queues one cluster reconciliation per existing cluster; after cache sync each item scans all ResourceBindings/ClusterResourceBindings and may requeue matching bindings even though no cluster changed. In large installations this can turn startup into O(clusters * bindings) reconcile work. Consider gating this like the ResourceSummary path, for example only enqueueing when s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings(), or otherwise distinguishing real post-start joins from initial informer replay.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 843cd37b — gated on s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings() and dropped the unconditional reconcile-worker enqueue in favor of MoveAllToActive().

// Len returns the length of activeQ.
Len() int

// HasUnschedulableBindings reports whether the unschedulableBindings sub-queue is non-empty.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

emm, precise, it is not a queue.

Comment thread pkg/scheduler/event_handler.go Outdated
s.clusterReconcileWorker.Add(newCluster)
case !equality.Semantic.DeepEqual(oldCluster.Status.ResourceSummary, newCluster.Status.ResourceSummary):
if s.priorityQueue != nil && s.priorityQueue.HasUnschedulableBindings() {
s.clusterReconcileWorker.Add(newCluster)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we focus on unschedulableBindings instead of a full scan of all bindings?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both the Conditions/APIEnablements case and the ResourceSummary case now call MoveAllToActive() directly, so
we no longer fan out into a full binding scan via clusterReconcileWorker.AddenqueueAffectedBindings.

@Tej-Katika Tej-Katika force-pushed the fix/scheduler-requeue-on-cluster-status-change branch 2 times, most recently from 465e9fb to 843cd37 Compare April 29, 2026 04:45
@karmada-bot
Copy link
Copy Markdown
Contributor

@Tej-Katika: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Len() int

// HasUnschedulableBindings reports whether the unschedulableBindings map is non-empty.
HasUnschedulableBindings() bool
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MoveAllToActive can be called safely, does we still need this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Since MoveAllToActive() is already safe to call when the unschedulable map is empty, the
HasUnschedulableBindings() guard is redundant. I'll drop the interface method and its implementation, and simplify the three call sites in event_handler.go to just if s.priorityQueue != nil { s.priorityQueue.MoveAllToActive()}.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dropped HasUnschedulableBindings() and simplified the three call sites to if s.priorityQueue != nil { s.priorityQueue.MoveAllToActive() }. Updated the docstring on MoveAllToActive to note it's safe to call when the unschedulable map is empty.

When a cluster's status changes (conditions, API enablements, or
resource summary), previously unschedulable bindings may now be
schedulable. This commit adds event handlers so such changes flush
unschedulable bindings directly to activeQ via MoveAllToActive(),
avoiding a full scan of all ResourceBindings.

- addCluster: call MoveAllToActive() so bindings stuck waiting for
  a new cluster to join are retried immediately
- updateCluster: replace clusterReconcileWorker.Add with
  MoveAllToActive() for status-change cases; the direct flush is
  cheaper than enqueueing a full binding scan via reconcileWorker
- MoveAllToActive() is safe to call when unschedulableBindings is
  empty, so no separate guard is needed

Signed-off-by: Tejashwar Reddy Katika <tejashwar1029@gmail.com>
@Tej-Katika Tej-Katika force-pushed the fix/scheduler-requeue-on-cluster-status-change branch from 843cd37 to 03485e0 Compare April 30, 2026 13:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants