Skip to content

Fix error setting app password in hack create due to eventual consistency in ms graph api#4736

Open
mrWinston wants to merge 2 commits intomasterfrom
eventual-consistency-issue-in-hack-cluster-create
Open

Fix error setting app password in hack create due to eventual consistency in ms graph api#4736
mrWinston wants to merge 2 commits intomasterfrom
eventual-consistency-issue-in-hack-cluster-create

Conversation

@mrWinston
Copy link
Copy Markdown
Collaborator

What this PR does / why we need it:

  • Creating a cluster with the hack script sometimes fails with this error:
INFO[2026-04-01T10:55:55+02:00]pkg/util/cluster/cluster.go:326 cluster.(*Cluster).createApp() Creating AAD application                     
(*odataerrors.MainError)(0xc0006128a0)({
 backingStore: (*store.InMemoryBackingStore)(0xc0007a42c0)({
  returnOnlyChangedValues: (bool) false,
  initializationCompleted: (bool) false,
  store: (map[string]interface {}) (len=3) {
   (string) (len=4) "code": (*string)(0xc0006125f0)((len=24) "Request_ResourceNotFound"),
   (string) (len=7) "message": (*string)(0xc000612670)((len=128) "Resource 'redacted' does not exist or one of its queried reference-property objects are not present."),
   (string) (len=10) "innerError": (*odataerrors.InnerError)(0xc000612930)({
....<bunch of other unrelated lines>....
})
FATA[2026-04-01T10:55:58+02:00]hack/cluster/cluster.go:57 main.main() error status code received from the API
  • this happens because of the eventual consistency of the graph api
    • When we create an Application there's can be a short window of time where looking up the new application under its id fails with a 404
  • I've added a polling loop to retry the password creation if it fails with this 404 error

Test plan for issue:

  • Tested by trying to create a cluster.

How do you know this will function as expected in production?

  • Only relevant for local dev environment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds resilience to local cluster creation by retrying Microsoft Graph AddPassword calls to handle eventual consistency after creating an AAD Application.

Changes:

  • Introduces retry constants for Graph API calls.
  • Wraps AddPassword in a polling/retry loop that retries on 404 “not found” errors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mrWinston mrWinston force-pushed the eventual-consistency-issue-in-hack-cluster-create branch from 6b4ca24 to c3b45a6 Compare April 1, 2026 14:52
Copilot AI review requested due to automatic review settings April 1, 2026 15:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// check if returned error is 404 not found, if so, retry the operation
numRetries := 0
var pwResult msgraph_models.PasswordCredentialable
err = wait.PollUntilWithContext(ctx, GraphApiRetryInterval, func(ctx context.Context) (done bool, err error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: use wait.PollImmediateUntilWithContext (or do one direct call before polling) so retries still happen, but success path stays fast.

// check if returned error is 404 not found, if so, retry the operation
numRetries := 0
var pwResult msgraph_models.PasswordCredentialable
err = wait.PollUntilWithContext(ctx, GraphApiRetryInterval, func(ctx context.Context) (done bool, err error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a focused unit test around this retry loop?
Useful cases: retries on 404, no retry on non-404, and stops after max retries.
That would make this eventual-consistency handling safer to refactor later.


// retry loop, due to eventual consistency, the application we just created might not be found when queried immediately, only after a retry
// check if returned error is 404 not found, if so, retry the operation
numRetries := 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants