Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the retry logic to treat HTTP 404 responses as transient (to mitigate a timing issue where metadata requests can arrive before the NodeService CCF cache file is written) and extends the retry test suite to cover the new behavior.
Changes:
- Add
http.StatusNotFound(404) to the set of retryable HTTP status codes. - Refactor the retry tests’ “healing server” into a configurable handler and add a test that validates retry-on-404 behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| internal/requesthelper/retry.go | Expands transient HTTP status classification to include 404 so WithRetries will retry NotFound responses. |
| internal/requesthelper/retry_test.go | Updates healing server test scaffolding and adds coverage ensuring 404s are retried until success. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func TestWithRetries_retry404(t *testing.T) { | ||
| h := &healingServer{ | ||
| HealAfter: 4, | ||
| FailCode: 404, | ||
| SuccessCode: 200, | ||
| } | ||
| srv := httptest.NewServer(h) |
There was a problem hiding this comment.
In this new test, the response body is closed via defer resp.Body.Close() before asserting err == nil / resp != nil (same pattern as the healing-server test). If the retry logic changes and the request doesn’t succeed, the test will panic with a nil-pointer dereference and mask the real failure. Prefer deferring the close only after confirming resp/resp.Body are non-nil (or guard the defer).
| h := &healingServer{ | ||
| HealAfter: 4, | ||
| FailCode: 500, | ||
| SuccessCode: 200, | ||
| } |
There was a problem hiding this comment.
Use http.Status... constants instead of raw status-code integers (e.g., http.StatusInternalServerError, http.StatusOK) to avoid magic numbers and make intent clearer.
|
General comment: In addition to there being 404s due to the NS cache not holding the latest CCF file path (metadata regarding newly added VM Apps not present), there is also the scenario where a VM App is already present on the VM and is being upgraded. In this case, if the latest CCF is not known by HGAP, it will provide metadata for the older version to the extension. This will never surface as a 404 and the new version would never be installed. This PR would address it - #58 - but I didn't merge it earlier because of the potential it has to surface larger numbers of failures, especially with RCVv2. I think this is a step in the right direction, we can start retrying on 404s for scenarios where the older goal state did not contain VM Apps and the extension encounters a 404. Eventually, we will want to come up with a plan for flighting the change for the extension to include the app version in its request to HGAP. |
Due to a NodeService CCF caching issue, the ccf file may not be written at the time the metadata request arrives. To get around this, simply retry on 404 responses from the host.