[Proposal] Scheduler Estimator Support for Dynamic Resource Allocation (DRA)#7389
[Proposal] Scheduler Estimator Support for Dynamic Resource Allocation (DRA)#7389seanlaii wants to merge 1 commit intokarmada-io:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request proposes a design to enhance the karmada-scheduler-estimator with support for Kubernetes Dynamic Resource Allocation (DRA). By enabling the estimator to account for non-traditional compute resources such as GPUs and FPGAs, the scheduler can make more accurate placement decisions in multi-cluster environments, preventing workloads from being assigned to clusters that cannot fulfill their specific hardware requirements. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request proposes adding Dynamic Resource Allocation (DRA) support to the Karmada scheduler estimator by introducing a new plugin and extending the ReplicaRequirements API. The review feedback identifies several necessary updates to align the proposal with the Kubernetes v1 DRA API, including flattening the DeviceRequest structure and correcting the AllNodes field type. Other recommendations include adding Kubebuilder validation markers for API safety, increasing the visibility of template resolution logs, and documenting internal functions.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #7389 +/- ##
==========================================
+ Coverage 42.04% 42.16% +0.12%
==========================================
Files 874 876 +2
Lines 53544 64968 +11424
==========================================
+ Hits 22515 27397 +4882
- Misses 29341 35872 +6531
- Partials 1688 1699 +11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ea81eee to
654bd7b
Compare
|
Hi @RainbowMango @GitHubxsy , the design document is ready for review. Please help take a look when you have a chance. Thank you for your time! |
There was a problem hiding this comment.
Pull request overview
Adds a design proposal describing how to extend karmada-scheduler-estimator to account for Kubernetes Dynamic Resource Allocation (DRA) when estimating max available replicas, to avoid scheduling workloads onto clusters that can’t satisfy device claims.
Changes:
- Introduces a new proposal document outlining a
DynamicResourceEstimatorplugin for DRA-aware capacity estimation. - Specifies proposed API/type extensions (e.g.,
ReplicaRequirements.DeviceClaim) and gRPC/protobuf propagation for device requests and CEL selectors. - Describes estimation algorithm details, feature gating (
DRAEstimator), and a unit/e2e test plan.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | `DRANodeAllocatableResources` | `Device.NodeAllocatableResourceMappings` | | ||
| | `DRAListTypeAttributes` | `DeviceAttribute.{IntValues,BoolValues,StringValues,VersionValues}`, CEL `includes()` helper | | ||
|
|
||
| When any of these gates become GA in a future Kubernetes release, support can be added as additive optional fields — see [Future Extensibility](#6-future-extensibility). | ||
|
|
There was a problem hiding this comment.
The “Feature-gated fields explicitly excluded” table lists DRANodeAllocatableResources / Device.NodeAllocatableResourceMappings and DRAListTypeAttributes / DeviceAttribute.{IntValues,...}. Those feature gates/fields don’t exist in the vendored k8s.io/api/resource/v1 types (v0.35.3). Please update this list to match the actual +featureGate annotations in the current API (or clearly mark items that are speculative/future).
| | `DRANodeAllocatableResources` | `Device.NodeAllocatableResourceMappings` | | |
| | `DRAListTypeAttributes` | `DeviceAttribute.{IntValues,BoolValues,StringValues,VersionValues}`, CEL `includes()` helper | | |
| When any of these gates become GA in a future Kubernetes release, support can be added as additive optional fields — see [Future Extensibility](#6-future-extensibility). | |
| Note: earlier drafts mentioned `DRANodeAllocatableResources` / `Device.NodeAllocatableResourceMappings` and `DRAListTypeAttributes` / `DeviceAttribute.{IntValues,BoolValues,StringValues,VersionValues}` (plus the CEL `includes()` helper) as possible future DRA extensions. These are not part of the current vendored `k8s.io/api/resource/v1` API and are therefore not listed as explicit exclusions here. | |
| When any of the current gates above become GA in a future Kubernetes release, support can be added as additive optional fields — see [Future Extensibility](#6-future-extensibility). |
…n (DRA) Signed-off-by: seanlaii <qazwsx0939059006@gmail.com>
What type of PR is this?
/kind feature
/kind documentation
What this PR does / why we need it:
This PR adds a design proposal for enhancing the karmada-scheduler-estimator to support Kubernetes Dynamic Resource Allocation (DRA). Currently, the estimator only evaluates traditional resources (CPU, memory) and has no awareness of DRA-managed devices such as GPUs, FPGAs, and smart NICs. This causes the scheduler to potentially assign workloads to clusters that cannot fulfill DRA requirements, leading to pods pending indefinitely.
The proposal introduces a new DynamicResourceEstimator plugin that queries ResourceSlice, ResourceClaim, and DeviceClass objects on member clusters to accurately estimate device availability. It extends ReplicaRequirements with a two-level DeviceClaim structure designed for future extensibility (constraints, device taints, etc.) without breaking changes. The feature is gated behind a DRAEstimator feature flag (alpha, default off).
Which issue(s) this PR fixes:
Fixes #7095
Special notes for your reviewer:
Does this PR introduce a user-facing change?: