feat(estimator): add NodeAutoscalerEstimator plugin by bluayer · Pull Request #7376 · karmada-io/karmada

bluayer · 2026-04-09T04:48:14Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Adds a new opt-in estimator plugin NodeAutoscalerEstimator that includes node autoscaler (Karpenter) potential capacity in MaxAvailableReplicas calculation.

The existing NodeResourceEstimator returns 0 when no nodes exist, which prevents dynamicWeight: AvailableReplicas from placing pods in clusters with Karpenter. Since no pods are created, Karpenter never provisions nodes.

NodeAutoscalerEstimator solves this by querying Karpenter NodePool spec.limits and NodeClaim status.allocatable to calculate remaining capacity. It also detects provisioning failures (e.g. ICE) and stops reporting capacity for failed NodePools until a recovery interval elapses.

Key design decisions:

Opt-in only: not in the default registry. Activated via --plugins=NodeAutoscalerEstimator,ResourceQuotaEstimator
No Karpenter dependency: uses dynamic client to query NodePool/NodeClaim CRDs
Dynamic resource matching: works with GPU, CPU, memory, and custom accelerators
Backward compatible: without --plugins, estimator behavior is unchanged
Scheduler change: calAvailableReplicas() skips general-estimator when scheduler-estimator reports higher availability. This only triggers when NodeAutoscalerEstimator is active — with the default NodeResourceEstimator, scheduler-estimator always returns ≤ general-estimator, so the skip condition is never true.

Which issue(s) this PR fixes:

Fixes #7375

Special notes for your reviewer:

E2E tested on EKS with Karmada v1.17.1:

Scale-from-zero with GPU (g6e.xlarge, g5.xlarge)
Cross-region replica distribution with dynamicWeight
ICE detection + descheduler convergence (no infinite loop)
Scale-out 1→4, scale-down + consolidation + scale-up
Test with vLLM inference workload (Qwen2.5-0.5B)
NodePool limit changes
CPU workload with multiple NodePools

29 unit tests covering plugin logic, failure detection, recovery, and resource matching.

Does this PR introduce a user-facing change?:

`karmada-scheduler-estimator`: Added `NodeAutoscalerEstimator` plugin and `--plugins` flag. When enabled, the estimator accounts for Karpenter NodePool capacity, allowing `dynamicWeight: AvailableReplicas` to work in scale-from-zero environments.

karmada-bot · 2026-04-09T04:48:25Z

Welcome @bluayer! It looks like this is your first PR to karmada-io/karmada 🎉

gemini-code-assist · 2026-04-09T04:49:48Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the karmada-scheduler-estimator by introducing a new NodeAutoscalerEstimator plugin. This plugin addresses a critical limitation in environments utilizing node autoscalers like Karpenter, particularly in scale-from-zero scenarios where the existing estimator might report zero capacity. By intelligently assessing the potential capacity offered by Karpenter NodePools and incorporating failure detection, it ensures that workload scheduling decisions accurately reflect the cluster's true scaling capabilities, thereby improving resource utilization and enabling seamless scaling for dynamic workloads.

Highlights

New Estimator Plugin: Introduced NodeAutoscalerEstimator to account for Karpenter NodePool capacity in MaxAvailableReplicas calculations, addressing limitations in scale-from-zero environments.
Karpenter Integration: The plugin queries Karpenter NodePool spec.limits and NodeClaim status.allocatable to calculate remaining capacity, enabling dynamicWeight: AvailableReplicas to function correctly with Karpenter.
Failure Detection & Recovery: Implemented logic to detect Karpenter provisioning failures (e.g., ICE) and temporarily stop reporting capacity for failed NodePools until a recovery interval elapses.
Opt-in Configuration: The NodeAutoscalerEstimator is an opt-in plugin, activated via a new --plugins flag, ensuring backward compatibility and allowing users to explicitly enable it.
Dynamic Resource Matching: The estimator supports dynamic resource matching for various types, including GPU, CPU, memory, and custom accelerators, utilizing a dynamic client without a direct Karpenter dependency.
Scheduler Logic Enhancement: Modified the calAvailableReplicas() function in the scheduler to prioritize results from the scheduler-estimator (when active) over the general-estimator if it reports higher availability, leveraging the new autoscaler capacity insights.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull request overview

This pull request introduces a new NodeAutoscalerEstimator plugin for Karmada that calculates pod capacity by accounting for Karpenter node autoscaler potential. This enables dynamicWeight: AvailableReplicas scheduling to work in scale-from-zero environments where nodes are provisioned on demand.

Changes:

New NodeAutoscalerEstimator plugin that combines existing node resources with Karpenter NodePool capacity
Karpenter provider implementation with failure detection and recovery mechanisms
Plugin registry updates to support optional extended plugins
Scheduler-side logic change to prefer scheduler-estimator results when they're higher than general-estimator
CLI flag --plugins to selectively enable estimator plugins

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
pkg/scheduler/core/util.go	Scheduler logic to prefer scheduler-estimator when available
pkg/estimator/server/server.go	Plugin loading and dynamic client initialization
pkg/estimator/server/framework/plugins/registry.go	Registry functions for in-tree and extended plugins
pkg/estimator/server/framework/plugins/nodeautoscaler/*	New NodeAutoscalerEstimator plugin implementation and Karpenter provider
cmd/scheduler-estimator/app/options/options.go	CLI flag for plugin selection

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request introduces the NodeAutoscalerEstimator plugin, allowing Karmada to estimate potential capacity from autoscalers like Karpenter. It adds the KarpenterProvider, a --plugins flag for the estimator server, and logic in the scheduler to prefer these results. Reviewers identified critical precision issues in resource calculations, advising the use of MilliValue() instead of Value() for CPU resources. Additionally, feedback suggests replacing recover() with standard error handling and refactoring dependency injection to avoid package-level global variables.

codecov-commenter · 2026-04-09T05:30:49Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 64.86486% with 78 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.25%. Comparing base (12d6f1d) to head (3c13b7a).
⚠️ Report is 60 commits behind head on master.

Files with missing lines	Patch %	Lines
...framework/plugins/nodeautoscaler/nodeautoscaler.go	54.71%	21 Missing and 3 partials ⚠️
...rver/framework/plugins/nodeautoscaler/karpenter.go	85.71%	9 Missing and 9 partials ⚠️
pkg/scheduler/core/util.go	21.73%	18 Missing ⚠️
pkg/estimator/server/server.go	11.11%	7 Missing and 1 partial ⚠️
...kg/estimator/server/framework/runtime/framework.go	16.66%	5 Missing ⚠️
pkg/estimator/server/framework/plugins/registry.go	0.00%	4 Missing ⚠️
cmd/scheduler-estimator/app/options/options.go	0.00%	1 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7376      +/-   ##
==========================================
+ Coverage   42.17%   42.25%   +0.08%     
==========================================
  Files         875      877       +2     
  Lines       53601    53835     +234     
==========================================
+ Hits        22604    22750     +146     
- Misses      29301    29374      +73     
- Partials     1696     1711      +15

Flag	Coverage Δ
unittests	`42.25% <64.86%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

zhzhuang-zju · 2026-04-09T06:49:24Z

Hi @bluayer, thank you for your feedback! This scenario seems quite reasonable.
Once your PR is ready, feel free to cc me for a review.

/assign

The existing NodeResourceEstimator returns available=0 when no nodes exist, which prevents dynamicWeight scheduling from placing pods in clusters with node autoscalers like Karpenter. Pods are never created, so the autoscaler never provisions nodes. NodeAutoscalerEstimator solves this by including potential capacity from Karpenter NodePool limits in the MaxAvailableReplicas calculation. Changes: - New plugin: pkg/estimator/server/framework/plugins/nodeautoscaler/ - CapacityProvider interface for pluggable autoscaler backends - KarpenterProvider: queries NodePool limits and NodeClaim usage via dynamic client (no Karpenter module dependency) - Dynamic resource matching (GPU, CPU, memory, custom accelerators) - Failure detection: marks NodePools as failed when no NodeClaims are created for failureThreshold (3min) despite Pending pods, auto-recovers after recoveryInterval (10min) - Add --plugins flag to scheduler-estimator for opt-in activation - Register in NewExtendedRegistry (default registry unchanged) - Scheduler util.go: prefer scheduler-estimator over general-estimator when scheduler-estimator reports higher availability Signed-off-by: Jungwoo Song <bluayer@gmail.com>

karmada-bot · 2026-04-09T06:58:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from zhzhuang-zju. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bluayer · 2026-04-09T10:06:35Z

Hi @zhzhuang-zju, I've addressed all the feedback from the AI reviewer and CI is passing cleanly.
Would appreciate your review when you get a chance. Thanks!

bluayer · 2026-04-15T03:48:08Z

Hi @zhzhuang-zju, just a gentle ping. Thanks!

zhzhuang-zju · 2026-04-15T06:22:59Z

Thanks @bluayer, the review is in progress. I will submit my review comments as soon as possible.

BTW, would you be interested in joining the Karmada community meeting to talk about this feature?

bluayer · 2026-04-15T06:31:47Z

Hi @zhzhuang-zju, thanks for the update!

I'd love to join the community meeting. Could you share more details about the format and what you'd expect me to cover?

zhzhuang-zju · 2026-04-15T06:47:50Z

I'd love to join the community meeting. Could you share more details about the format and what you'd expect me to cover?

That would be great! First, due to geographical and language considerations, the Karmada community meetings are held in two separate sessions, every two weeks each. You can choose either one to attend based on your schedule.

There is no specific format required. You only need to add your agenda under the corresponding meeting section in the document at https://docs.google.com/document/d/1y6YLVC-v7cmVAdbjedoyR5WL0-q45DBRXTvz5_I7bkA/edit?tab=t.0#heading=h.g61sgp7w0d0c. You can share this feature with the Karmada community and other users, such as its use cases, how to use it, and any other requirements or expectations you may have for the community, and so on.

Replace package-level variable dynamicClientForProvider with proper dependency injection through framework.Handle interface, following the same pattern used in PR karmada-io#6877 for Parallelism(). Changes: - Add DynamicClient() to framework.Handle interface - Add WithDynamicClient option to framework runtime - Pass dynamic client via WithDynamicClient in server.go - Remove SetDynamicClient and package-level variable from nodeautoscaler Signed-off-by: Jungwoo Song <bluayer@gmail.com>

bluayer · 2026-04-20T01:57:20Z

Just a few updates ( cc @zhzhuang-zju ) :

I've added this PR as an agenda item of Apr 28 in the Google Docs
I've addressed the comments left on the PR.

Please let me know if you have any additional comments.

zhzhuang-zju · 2026-04-21T02:03:21Z

I've added this PR as an agenda item of Apr 28 in the Google Docs
I've addressed the comments left on the PR.

@bluayer Thanks for the quick response, and I'm looking forward to the community meeting on Apr 28. Sorry for the delayed follow-up on my side — I've been a bit busy recently, but I'll prioritize this PR and continue pushing it forward.

bluayer · 2026-04-21T03:09:04Z

Hey @zhzhuang-zju, No worries at all. I completely understand your situation and recognize that this is a XXL size PR.

It was just a reminder. Don't worry!

zhzhuang-zju · 2026-04-28T11:58:01Z

Hi @bluayer, I noticed that you will be joining the community meeting tonight, which is great. Let me briefly note the main question I would like to discuss.

My main concern is the overlap between the proposed NodeAutoscaler plugin and the existing NodeResource plugin. NodeAutoscaler is introduced to account for potential capacity from node autoscalers, but part of that estimation appears to overlap with NodeResource.

I am not sure using --plugins alone is enough to avoid duplicate calculation. For example, with --plugins=NodeAutoscalerEstimator,NodeResourceEstimator,ResourceQuotaEstimator, the capacity of existing nodes could still be counted twice.

So should node autoscaler support be integrated into NodeResource plugin behind a flag, or, if kept as a separate plugin, should it be mutually exclusive with NodeResource?

bluayer · 2026-04-28T13:06:40Z

@zhzhuang-zju, This is something I've been thinking about too, and I'm glad you brought it up. Also I would love to discuss this further.

When all three plugins are enabled together, NodeAutoscaler does calculate current + potential, but the framework's min() aggregation ensures NodeResource's value takes precedence. This is intentional because I wanted to preserve the existing behavior by default. But I agree with you, it can be misleading and wastes computation by calculating the same thing twice.

At the first time, I designed NodeAutoscaler after NodeResource and them as separate, mutually exclusive plugins for different use cases. NodeResource is for static workloads where cluster capacity is fixed. NodeAutoscaler is for dynamic workloads where an autoscaler (e.g., Karpenter) can provision additional nodes. Users would choose one depending on their environment.

If we wanted to support combining multiple plugins additively, we'd need to design and change the framework's aggregation logic. That's a significantly larger change than adding a plugin, and I'm concerned it would introduce confusion for existing Karmada users who rely on the current behavior.

So my proposal is to keep the current structure and add a startup validation that rejects enabling bothNodeAutoscaler and NodeResource plugins at the same time, with an error message. What do you think?

RainbowMango · 2026-05-06T08:19:46Z

At the first time, I designed NodeAutoscaler after NodeResource and them as separate, mutually exclusive plugins for different use cases. NodeResource is for static workloads where cluster capacity is fixed. NodeAutoscaler is for dynamic workloads where an autoscaler (e.g., Karpenter) can provision additional nodes. Users would choose one depending on their environment.

That(behavior) is what I inferred from the code. For the use case where an autoscaler(e.g., Karpenter) is enabled, the flag --plugins=NodeAutoscalerEstimator,ResourceQuotaEstimator is needed, which skips the existing NodeResource plugin.

In fact, the NodeAutoscalerEstimator is more like an enhanced plugin of the existing NodeResourceEstimator; they share some common code. That means we have to maintain both of them simultaneously.

If we wanted to support combining multiple plugins additively, we'd need to design and change the framework's aggregation logic. That's a significantly larger change than adding a plugin, and I'm concerned it would introduce confusion for existing Karmada users who rely on the current behavior.

I'm thinking of this(aggregation logic) as well; this is probably one of the alternatives. PS: Changing size is not a concern to me.

Copilot AI review requested due to automatic review settings April 9, 2026 04:48

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 9, 2026

karmada-bot requested review from Garrybest and zhzhuang-zju April 9, 2026 04:48

karmada-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 9, 2026

Copilot started reviewing on behalf of bluayer April 9, 2026 04:48 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

Comment thread pkg/scheduler/core/util.go

Comment thread pkg/scheduler/core/util.go

Comment thread pkg/scheduler/core/util.go Outdated

gemini-code-assist Bot reviewed Apr 9, 2026

View reviewed changes

bluayer force-pushed the feature/node-autoscaler-estimator-plugin branch 3 times, most recently from c751507 to f40c68d Compare April 9, 2026 05:13

bluayer force-pushed the feature/node-autoscaler-estimator-plugin branch 2 times, most recently from 9cd9d72 to 5251976 Compare April 9, 2026 06:47

karmada-bot assigned zhzhuang-zju Apr 9, 2026

bluayer force-pushed the feature/node-autoscaler-estimator-plugin branch from 5251976 to 752abf6 Compare April 9, 2026 06:58

bluayer mentioned this pull request Apr 16, 2026

Add estimator plugin to support node autoscaler capacity (Karpenter/CAS) #7375

Open

zhzhuang-zju reviewed Apr 17, 2026

View reviewed changes

Comment thread pkg/scheduler/core/util_test.go Outdated

Comment thread pkg/estimator/server/framework/plugins/nodeautoscaler/nodeautoscaler.go Outdated

RainbowMango mentioned this pull request May 7, 2026

karmada scheduler supports user-customizable scheduler estimator mode #6416

Open

Conversation

bluayer commented Apr 9, 2026

Uh oh!

karmada-bot commented Apr 9, 2026

Uh oh!

gemini-code-assist Bot commented Apr 9, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zhzhuang-zju commented Apr 9, 2026

Uh oh!

karmada-bot commented Apr 9, 2026

Uh oh!

bluayer commented Apr 9, 2026

Uh oh!

bluayer commented Apr 15, 2026

Uh oh!

zhzhuang-zju commented Apr 15, 2026

Uh oh!

bluayer commented Apr 15, 2026

Uh oh!

zhzhuang-zju commented Apr 15, 2026

Uh oh!

Uh oh!

Uh oh!

bluayer commented Apr 20, 2026

Uh oh!

zhzhuang-zju commented Apr 21, 2026

Uh oh!

bluayer commented Apr 21, 2026

Uh oh!

zhzhuang-zju commented Apr 28, 2026

Uh oh!

bluayer commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RainbowMango commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

codecov-commenter commented Apr 9, 2026 •

edited

Loading

bluayer commented Apr 28, 2026 •

edited

Loading