docs: add zonal shift RFC by DerekFrank · Pull Request #9010 · aws/karpenter-provider-aws

DerekFrank · 2026-03-10T00:10:07Z

Fixes: #7271

Description

How was this change tested?

Does this change impact docs?

Yes, PR includes docs updates
Yes, issue opened: #
No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

github-actions · 2026-03-10T00:12:44Z

Preview deployment ready!

Preview URL: https://pr-9010.d18coufmbnnaag.amplifyapp.com

Built from commit da8ba0c7f0eb601d6bc8406ebb79598139347a08

jicowan · 2026-03-10T02:33:42Z

designs/zonal-shift.md

+
+1. Stop provisioning capacity in the **impaired** AZ
+2. Stop performing voluntary disruption in the **impaired** AZ.
+3. Stop performing voluntary disruption in the **unimpaired** AZs if the disruption relies on scheduling pods to the **impaired** AZ.


Why would you discontinue disrupting instances in unimpaired AZs, e.g. underutilized or empty? If an application relies on infrastructure in the impaired AZ, it won't get scheduled unless your scheduling requirements are flexible. Are you worried losing capacity in the unimpaired AZs during an outage?

It would be preferable to stop all disruption, but that is not a hard requirement. To make that change we need integration with upstream, which we can come later as a supplement. I think there is an issue upstream for stopping disruption: kubernetes-sigs/karpenter#2497

This might make a natural addition to that

IMO if an AZ is truly down we rarely want to reduce our capacity in any other AZ and we attempt to lock the world. We want to minimize churn during these situations as we attempt to migrate services out of that AZ and if there is extra capacity, we are likely going to fill them very quickly.

Allowing Karpenter to disrupt would just make migration slower if capacity is going down and then respinning back up.

jicowan · 2026-03-10T02:38:24Z

designs/zonal-shift.md

+2. Stop performing voluntary disruption in the **impaired** AZ.
+3. Stop performing voluntary disruption in the **unimpaired** AZs if the disruption relies on scheduling pods to the **impaired** AZ.
+4. Pods with strict scheduling requirements that require capacity in the impaired AZ such as volume requirements or node affinities **should not** result in launch attempts
+5. If an option is set, pods with TSCs that require capacity in the impaired AZ should instead have capacity launched into unimpaired AZs while still maintaining skew between the remaining unimpaired AZs.


If cluster topology consists of 3 zones and 1 is impaired, how will pods get scheduled in the unimpaired zones (without changing the whenUnsatisfiable to scheduleAnyway)?

They don't: https://github.com/aws/karpenter-provider-aws/pull/9010/changes#diff-1f0393bb8852fdd9df7b92a6e278f35d2627542dde29a0530df165b36566c080R21

This is something I am hoping to change upstream about TSCs

designs/zonal-shift.md

garvinp-stripe · 2026-03-17T19:24:52Z

designs/zonal-shift.md

+
+### Events
+
+Karpenter could event against nodepools that allow instances in the impaired AZ to indicate that new nodes cannot be provisioned in a given AZ. This is not required for an initial release, but could be a nice follow up.


I want say at the very least the NodePool/ NodeClass' status should be updated with "impaired az"

designs/zonal-shift.md

GnatorX · 2026-03-17T19:28:33Z

Sorry didnt realize which user i was on Github. Overall looks good to me. Minor comments on understanding how certain pieces works. Thanks for the write up! Looking forward to this

DerekFrank requested a review from a team as a code owner March 10, 2026 00:10

DerekFrank requested a review from Youssef-Beltagy March 10, 2026 00:10

jicowan reviewed Mar 10, 2026

View reviewed changes

garvinp-stripe reviewed Mar 17, 2026

View reviewed changes

designs/zonal-shift.md Outdated Show resolved Hide resolved

garvinp-stripe reviewed Mar 17, 2026

View reviewed changes

designs/zonal-shift.md Outdated Show resolved Hide resolved

garvinp-stripe reviewed Mar 17, 2026

View reviewed changes

designs/zonal-shift.md Outdated Show resolved Hide resolved

docs: add zonal shift RFC

da8ba0c

DerekFrank force-pushed the zonal-shift-rfc branch from ea5e9d6 to da8ba0c Compare March 19, 2026 18:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add zonal shift RFC#9010

docs: add zonal shift RFC#9010
DerekFrank wants to merge 1 commit intoaws:mainfrom
DerekFrank:zonal-shift-rfc

DerekFrank commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026 •

edited

Loading

Uh oh!

jicowan Mar 10, 2026

Uh oh!

DerekFrank Mar 10, 2026

Uh oh!

garvinp-stripe Mar 16, 2026

Uh oh!

jicowan Mar 10, 2026

Uh oh!

DerekFrank Mar 10, 2026

Uh oh!

Uh oh!

garvinp-stripe Mar 17, 2026

Uh oh!

Uh oh!

Uh oh!

GnatorX commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants


		### Events

		Karpenter could event against nodepools that allow instances in the impaired AZ to indicate that new nodes cannot be provisioned in a given AZ. This is not required for an initial release, but could be a nice follow up.

Conversation

DerekFrank commented Mar 10, 2026

Uh oh!

github-actions bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jicowan Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

DerekFrank Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

garvinp-stripe Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

jicowan Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

DerekFrank Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garvinp-stripe Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

GnatorX commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Mar 10, 2026 •

edited

Loading