feat(estimator): add NodeAutoscalerEstimator plugin#7376
feat(estimator): add NodeAutoscalerEstimator plugin#7376bluayer wants to merge 2 commits intokarmada-io:masterfrom
Conversation
|
Welcome @bluayer! It looks like this is your first PR to karmada-io/karmada 🎉 |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Pull request overview
This pull request introduces a new NodeAutoscalerEstimator plugin for Karmada that calculates pod capacity by accounting for Karpenter node autoscaler potential. This enables dynamicWeight: AvailableReplicas scheduling to work in scale-from-zero environments where nodes are provisioned on demand.
Changes:
- New NodeAutoscalerEstimator plugin that combines existing node resources with Karpenter NodePool capacity
- Karpenter provider implementation with failure detection and recovery mechanisms
- Plugin registry updates to support optional extended plugins
- Scheduler-side logic change to prefer scheduler-estimator results when they're higher than general-estimator
- CLI flag
--pluginsto selectively enable estimator plugins
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/core/util.go | Scheduler logic to prefer scheduler-estimator when available |
| pkg/estimator/server/server.go | Plugin loading and dynamic client initialization |
| pkg/estimator/server/framework/plugins/registry.go | Registry functions for in-tree and extended plugins |
| pkg/estimator/server/framework/plugins/nodeautoscaler/* | New NodeAutoscalerEstimator plugin implementation and Karpenter provider |
| cmd/scheduler-estimator/app/options/options.go | CLI flag for plugin selection |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces the NodeAutoscalerEstimator plugin, allowing Karmada to estimate potential capacity from autoscalers like Karpenter. It adds the KarpenterProvider, a --plugins flag for the estimator server, and logic in the scheduler to prefer these results. Reviewers identified critical precision issues in resource calculations, advising the use of MilliValue() instead of Value() for CPU resources. Additionally, feedback suggests replacing recover() with standard error handling and refactoring dependency injection to avoid package-level global variables.
c751507 to
f40c68d
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #7376 +/- ##
==========================================
+ Coverage 42.17% 42.25% +0.08%
==========================================
Files 875 877 +2
Lines 53601 53835 +234
==========================================
+ Hits 22604 22750 +146
- Misses 29301 29374 +73
- Partials 1696 1711 +15
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9cd9d72 to
5251976
Compare
|
Hi @bluayer, thank you for your feedback! This scenario seems quite reasonable. /assign |
The existing NodeResourceEstimator returns available=0 when no nodes
exist, which prevents dynamicWeight scheduling from placing pods in
clusters with node autoscalers like Karpenter. Pods are never created,
so the autoscaler never provisions nodes.
NodeAutoscalerEstimator solves this by including potential capacity
from Karpenter NodePool limits in the MaxAvailableReplicas calculation.
Changes:
- New plugin: pkg/estimator/server/framework/plugins/nodeautoscaler/
- CapacityProvider interface for pluggable autoscaler backends
- KarpenterProvider: queries NodePool limits and NodeClaim usage
via dynamic client (no Karpenter module dependency)
- Dynamic resource matching (GPU, CPU, memory, custom accelerators)
- Failure detection: marks NodePools as failed when no NodeClaims
are created for failureThreshold (3min) despite Pending pods,
auto-recovers after recoveryInterval (10min)
- Add --plugins flag to scheduler-estimator for opt-in activation
- Register in NewExtendedRegistry (default registry unchanged)
- Scheduler util.go: prefer scheduler-estimator over general-estimator
when scheduler-estimator reports higher availability
Signed-off-by: Jungwoo Song <bluayer@gmail.com>
5251976 to
752abf6
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @zhzhuang-zju, I've addressed all the feedback from the AI reviewer and CI is passing cleanly. |
|
Hi @zhzhuang-zju, just a gentle ping. Thanks! |
|
Thanks @bluayer, the review is in progress. I will submit my review comments as soon as possible. BTW, would you be interested in joining the Karmada community meeting to talk about this feature? |
|
Hi @zhzhuang-zju, thanks for the update! I'd love to join the community meeting. Could you share more details about the format and what you'd expect me to cover? |
That would be great! First, due to geographical and language considerations, the Karmada community meetings are held in two separate sessions, every two weeks each. You can choose either one to attend based on your schedule. There is no specific format required. You only need to add your agenda under the corresponding meeting section in the document at https://docs.google.com/document/d/1y6YLVC-v7cmVAdbjedoyR5WL0-q45DBRXTvz5_I7bkA/edit?tab=t.0#heading=h.g61sgp7w0d0c. You can share this feature with the Karmada community and other users, such as its use cases, how to use it, and any other requirements or expectations you may have for the community, and so on. |
Replace package-level variable dynamicClientForProvider with proper dependency injection through framework.Handle interface, following the same pattern used in PR karmada-io#6877 for Parallelism(). Changes: - Add DynamicClient() to framework.Handle interface - Add WithDynamicClient option to framework runtime - Pass dynamic client via WithDynamicClient in server.go - Remove SetDynamicClient and package-level variable from nodeautoscaler Signed-off-by: Jungwoo Song <bluayer@gmail.com>
|
Just a few updates ( cc @zhzhuang-zju ) :
Please let me know if you have any additional comments. |
@bluayer Thanks for the quick response, and I'm looking forward to the community meeting on Apr 28. Sorry for the delayed follow-up on my side — I've been a bit busy recently, but I'll prioritize this PR and continue pushing it forward. |
|
Hey @zhzhuang-zju, No worries at all. I completely understand your situation and recognize that this is a XXL size PR. It was just a reminder. Don't worry! |
|
Hi @bluayer, I noticed that you will be joining the community meeting tonight, which is great. Let me briefly note the main question I would like to discuss. My main concern is the overlap between the proposed I am not sure using So should node autoscaler support be integrated into |
|
@zhzhuang-zju, This is something I've been thinking about too, and I'm glad you brought it up. Also I would love to discuss this further. When all three plugins are enabled together, At the first time, I designed If we wanted to support combining multiple plugins additively, we'd need to design and change the framework's aggregation logic. That's a significantly larger change than adding a plugin, and I'm concerned it would introduce confusion for existing Karmada users who rely on the current behavior. So my proposal is to keep the current structure and add a startup validation that rejects enabling both |
That(behavior) is what I inferred from the code. For the use case where an autoscaler(e.g., Karpenter) is enabled, the flag In fact, the
I'm thinking of this(aggregation logic) as well; this is probably one of the alternatives. PS: Changing size is not a concern to me. |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Adds a new opt-in estimator plugin
NodeAutoscalerEstimatorthat includes node autoscaler (Karpenter) potential capacity inMaxAvailableReplicascalculation.The existing
NodeResourceEstimatorreturns 0 when no nodes exist, which preventsdynamicWeight: AvailableReplicasfrom placing pods in clusters with Karpenter. Since no pods are created, Karpenter never provisions nodes.NodeAutoscalerEstimatorsolves this by querying Karpenter NodePoolspec.limitsand NodeClaimstatus.allocatableto calculate remaining capacity. It also detects provisioning failures (e.g. ICE) and stops reporting capacity for failed NodePools until a recovery interval elapses.Key design decisions:
--plugins=NodeAutoscalerEstimator,ResourceQuotaEstimator--plugins, estimator behavior is unchangedcalAvailableReplicas()skipsgeneral-estimatorwhenscheduler-estimatorreports higher availability. This only triggers whenNodeAutoscalerEstimatoris active — with the defaultNodeResourceEstimator,scheduler-estimatoralways returns ≤general-estimator, so the skip condition is never true.Which issue(s) this PR fixes:
Fixes #7375
Special notes for your reviewer:
E2E tested on EKS with Karmada v1.17.1:
29 unit tests covering plugin logic, failure detection, recovery, and resource matching.
Does this PR introduce a user-facing change?: