Skip to content

[Proposal] Scheduler Estimator Support for Dynamic Resource Allocation (DRA)#7389

Open
seanlaii wants to merge 1 commit intokarmada-io:masterfrom
seanlaii:dra-estimator
Open

[Proposal] Scheduler Estimator Support for Dynamic Resource Allocation (DRA)#7389
seanlaii wants to merge 1 commit intokarmada-io:masterfrom
seanlaii:dra-estimator

Conversation

@seanlaii
Copy link
Copy Markdown
Contributor

What type of PR is this?
/kind feature
/kind documentation

What this PR does / why we need it:
This PR adds a design proposal for enhancing the karmada-scheduler-estimator to support Kubernetes Dynamic Resource Allocation (DRA). Currently, the estimator only evaluates traditional resources (CPU, memory) and has no awareness of DRA-managed devices such as GPUs, FPGAs, and smart NICs. This causes the scheduler to potentially assign workloads to clusters that cannot fulfill DRA requirements, leading to pods pending indefinitely.

The proposal introduces a new DynamicResourceEstimator plugin that queries ResourceSlice, ResourceClaim, and DeviceClass objects on member clusters to accurately estimate device availability. It extends ReplicaRequirements with a two-level DeviceClaim structure designed for future extensibility (constraints, device taints, etc.) without breaking changes. The feature is gated behind a DRAEstimator feature flag (alpha, default off).

Which issue(s) this PR fixes:

Fixes #7095

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Apr 14, 2026
@karmada-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign chaunceyjiang for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot requested a review from Tingtal April 14, 2026 02:15
@karmada-bot karmada-bot added the kind/documentation Categorizes issue or PR as related to documentation. label Apr 14, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request proposes a design to enhance the karmada-scheduler-estimator with support for Kubernetes Dynamic Resource Allocation (DRA). By enabling the estimator to account for non-traditional compute resources such as GPUs and FPGAs, the scheduler can make more accurate placement decisions in multi-cluster environments, preventing workloads from being assigned to clusters that cannot fulfill their specific hardware requirements.

Highlights

  • DRA Support Proposal: Introduces a design proposal for integrating Kubernetes Dynamic Resource Allocation (DRA) into the karmada-scheduler-estimator to enable awareness of specialized hardware like GPUs.
  • API and Plugin Enhancements: Proposes extending the ReplicaRequirements API with a DeviceClaim structure and implementing a new DynamicResourceEstimator plugin to calculate cluster-level device availability.
  • Feature Gating: Ensures the new DRA estimation capabilities are gated behind a DRAEstimator feature flag, defaulting to off, to maintain stability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@karmada-bot karmada-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Apr 14, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request proposes adding Dynamic Resource Allocation (DRA) support to the Karmada scheduler estimator by introducing a new plugin and extending the ReplicaRequirements API. The review feedback identifies several necessary updates to align the proposal with the Kubernetes v1 DRA API, including flattening the DeviceRequest structure and correcting the AllNodes field type. Other recommendations include adding Kubebuilder validation markers for API safety, increasing the visibility of template resolution logs, and documenting internal functions.

Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 14, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 42.16%. Comparing base (6012227) to head (18cf1d8).
⚠️ Report is 58 commits behind head on master.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7389      +/-   ##
==========================================
+ Coverage   42.04%   42.16%   +0.12%     
==========================================
  Files         874      876       +2     
  Lines       53544    64968   +11424     
==========================================
+ Hits        22515    27397    +4882     
- Misses      29341    35872    +6531     
- Partials     1688     1699      +11     
Flag Coverage Δ
unittests 42.16% <ø> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@seanlaii seanlaii force-pushed the dra-estimator branch 2 times, most recently from ea81eee to 654bd7b Compare April 20, 2026 17:04
@seanlaii seanlaii marked this pull request as ready for review April 20, 2026 17:06
Copilot AI review requested due to automatic review settings April 20, 2026 17:06
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2026
@karmada-bot karmada-bot requested a review from mszacillo April 20, 2026 17:06
@seanlaii
Copy link
Copy Markdown
Contributor Author

Hi @RainbowMango @GitHubxsy , the design document is ready for review. Please help take a look when you have a chance. Thank you for your time!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a design proposal describing how to extend karmada-scheduler-estimator to account for Kubernetes Dynamic Resource Allocation (DRA) when estimating max available replicas, to avoid scheduling workloads onto clusters that can’t satisfy device claims.

Changes:

  • Introduces a new proposal document outlining a DynamicResourceEstimator plugin for DRA-aware capacity estimation.
  • Specifies proposed API/type extensions (e.g., ReplicaRequirements.DeviceClaim) and gRPC/protobuf propagation for device requests and CEL selectors.
  • Describes estimation algorithm details, feature gating (DRAEstimator), and a unit/e2e test plan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
Comment on lines +127 to +131
| `DRANodeAllocatableResources` | `Device.NodeAllocatableResourceMappings` |
| `DRAListTypeAttributes` | `DeviceAttribute.{IntValues,BoolValues,StringValues,VersionValues}`, CEL `includes()` helper |

When any of these gates become GA in a future Kubernetes release, support can be added as additive optional fields — see [Future Extensibility](#6-future-extensibility).

Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Feature-gated fields explicitly excluded” table lists DRANodeAllocatableResources / Device.NodeAllocatableResourceMappings and DRAListTypeAttributes / DeviceAttribute.{IntValues,...}. Those feature gates/fields don’t exist in the vendored k8s.io/api/resource/v1 types (v0.35.3). Please update this list to match the actual +featureGate annotations in the current API (or clearly mark items that are speculative/future).

Suggested change
| `DRANodeAllocatableResources` | `Device.NodeAllocatableResourceMappings` |
| `DRAListTypeAttributes` | `DeviceAttribute.{IntValues,BoolValues,StringValues,VersionValues}`, CEL `includes()` helper |
When any of these gates become GA in a future Kubernetes release, support can be added as additive optional fields — see [Future Extensibility](#6-future-extensibility).
Note: earlier drafts mentioned `DRANodeAllocatableResources` / `Device.NodeAllocatableResourceMappings` and `DRAListTypeAttributes` / `DeviceAttribute.{IntValues,BoolValues,StringValues,VersionValues}` (plus the CEL `includes()` helper) as possible future DRA extensions. These are not part of the current vendored `k8s.io/api/resource/v1` API and are therefore not listed as explicit exclusions here.
When any of the current gates above become GA in a future Kubernetes release, support can be added as additive optional fields — see [Future Extensibility](#6-future-extensibility).

Copilot uses AI. Check for mistakes.
Comment thread docs/proposals/scheduling/dra-estimator-support/README.md Outdated
…n (DRA)

Signed-off-by: seanlaii <qazwsx0939059006@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/documentation Categorizes issue or PR as related to documentation. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Support Kubernetes DRA in karmada-scheduler-estimator

4 participants