feat: Integrate AWS Application Recovery Controller Zonal Shift with Karpenter#9042
feat: Integrate AWS Application Recovery Controller Zonal Shift with Karpenter#9042
Conversation
|
/assign |
| "github.com/awslabs/operatorpkg/option" | ||
| "sigs.k8s.io/controller-runtime/pkg/manager" | ||
|
|
||
| arczonalshiftProvider "github.com/aws/karpenter-provider-aws/pkg/providers/arczonalshift" |
There was a problem hiding this comment.
| arczonalshiftProvider "github.com/aws/karpenter-provider-aws/pkg/providers/arczonalshift" | |
| ZonalShiftProvider arconalshiftprovider.Provider |
We don't use camel case in imports
|
|
||
| type DefaultProvider struct { | ||
| sync.RWMutex | ||
| zonalShiftStatuses map[string]shiftStatus |
There was a problem hiding this comment.
nit: when you have a [string] map, its nice to add a comment that tells people what it maps from->to
| zonalShiftStatuses map[string]shiftStatus | |
| zonalShiftStatuses map[string]shiftStatus // map zoneid (string) -> shiftStatus |
| sync.RWMutex | ||
| zonalShiftStatuses map[string]shiftStatus | ||
|
|
||
| client sdk.ARCZonalShiftAPI |
There was a problem hiding this comment.
nit: follow the usage of ec2API in other providers
| client sdk.ARCZonalShiftAPI | |
| arcZonalShiftAPI sdk.ARCZonalShiftAPI |
| return nil | ||
| } | ||
|
|
||
| func (p *DefaultProvider) IsZonalShifted(ctx context.Context, zone string) bool { |
There was a problem hiding this comment.
nit: we should be specific if its zoneId or zoneName and only accept one unless we need to accept multiple
| }), "failed to setup node instanceID indexer") | ||
| } | ||
|
|
||
| func GetAvailablityZoneMapping(ctx context.Context, ec2Api sdk.EC2API) map[string]string { |
There was a problem hiding this comment.
This doesn't need to exist, we have the mapping from subnets already, more below
| func (p *DefaultProvider) IsZonalShifted(ctx context.Context, zone string) bool { | ||
| p.RLock() | ||
| defer p.RUnlock() | ||
| //if shift, ok := p.zonalShiftStatuses[zone]; ok { |
There was a problem hiding this comment.
I'm assuming we'll rip these out :D
| } | ||
| } | ||
|
|
||
| func (p *DefaultProvider) FetchZonalShifts(ctx context.Context) (map[string]shiftStatus, error) { |
| if getManagedResourceErr != nil { | ||
| // Resource is not found/registered in Zonal Shift. Log a message and use the NoopProvider so we don't block starting up. | ||
| log.FromContext(ctx).WithValues("Cluster", clusterArn).V(1).Info("Cluster not found in Zonal Shift") | ||
| zsProvider = arczonalshiftProvider.NewNoopProvider() |
There was a problem hiding this comment.
This should probably panic, I don't think we should fail with just a log
| _, getManagedResourceErr := arczonalshiftAPI.GetManagedResource(ctx, &inputGMR) | ||
| if getManagedResourceErr != nil { | ||
| // Resource is not found/registered in Zonal Shift. Log a message and use the NoopProvider so we don't block starting up. | ||
| log.FromContext(ctx).WithValues("Cluster", clusterArn).V(1).Info("Cluster not found in Zonal Shift") |
| } | ||
|
|
||
| func (c *Controller) Register(_ context.Context, m manager.Manager) error { | ||
| return controllerruntime.NewControllerManagedBy(m).Named("zonalshift").WatchesRawSource(singleton.Source()).Complete(singleton.AsReconciler(c)) |
There was a problem hiding this comment.
nit: you can put these . calls on new lines
Fixes #N/A
Description
This change integrates AWS's Application Recovery Controller Zonal Shift with Karpenter. Occasionally, zones in cloud providers can experience temporary outages. During these events, Karpenter's actions do not improve its cluster's availability posture and can sometimes exacerbate the scenario. By integrating Zonal Shift, Karpenter can be made aware of a customer's intention to shift traffic and scaling away from a zone.
Zonal Shift Documentation: https://docs.aws.amazon.com/r53recovery/latest/dg/arc-zonal-shift.html
Using Zonal Shift with EKS: https://docs.aws.amazon.com/r53recovery/latest/dg/arc-zonal-shift.resource-types.eks.html
How was this change tested?
Local testing by deploying to an AWS EKS cluster and validating that the integration works as expected.
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.