fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout#1035
Open
unameisfine wants to merge 1 commit intogonka-ai:mainfrom
Open
fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout#1035unameisfine wants to merge 1 commit intogonka-ai:mainfrom
unameisfine wants to merge 1 commit intogonka-ai:mainfrom
Conversation
…eout Before this change, sendAndProcess swallowed any SendOnly error whose response was nil, returning (false, 0, nil) regardless of the underlying cause. A 403 "sender not in group" from the host auth layer — a misconfiguration that cannot succeed on retry — was therefore indistin- guishable from a brief network blip, and runInference waited through RefusalTimeout and ExecutionTimeout before eventually returning a misleading "insufficient timeout votes" error. Classify HTTP transport errors by status code and fail fast on fatal 4xx responses (400/401/403/404/422/...), while keeping 5xx, 429 and network-level failures on the existing deadline-based retry path so transient issues still recover. Changes: - transport: add HTTPStatusError typed error with StatusCode/Path/Body and IsFatal() / IsFatalHTTPError() helpers (4xx except 429 = fatal). - transport: doPostRaw returns *HTTPStatusError on non-200 so callers can classify via errors.As. - subnetctl/proxy: sendAndProcess propagates fatal HTTP errors immediately with "host rejected inference: ..." wrapping, preserving the status code in the error chain for future caller mapping. - tests: red-green regression test that fails without the fix (waits through RefusalTimeout and returns a timeout error) and passes with the fix (returns in <500ms with the 403 cause visible, inference stays Pending instead of being marked TimedOut). Unit tests cover the status classification table. Closes gonka-ai#1019
Doog-bot534
approved these changes
Apr 15, 2026
Doog-bot534
left a comment
There was a problem hiding this comment.
Review: fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout
Strong Approve ✅
The cleanest PR of the batch. Well-scoped, well-tested, and the abstraction (`HTTPStatusError` + `IsFatal()`) is reusable and properly layered.
Strengths
- Proper structured error with `errors.As` support. The `IsFatal()` classification (4xx fatal except 429) matches HTTP semantics.
- Backward-compatible change — existing `err.Error()` callers see the same string format.
- Excellent test: asserts error message content, error chain unwrapping, timing (<500ms vs 2s+), AND state consistency (inference stays Pending).
Minor suggestions
-
Response body truncation: `doPostRaw` reads the entire response body into the error via `io.ReadAll`. A malicious host could return a multi-MB error body. Consider truncating `Body` to ~4KB in `HTTPStatusError`.
-
HTTP 408 Request Timeout: Currently classified as fatal (4xx). It's technically retryable. Edge case — hosts probably don't return 408 — but worth considering for completeness.
Verdict: Ship it. Clean design, strong tests, well-scoped change.
Payout address: gonka10zaal553duxp05nvfpqtsqrm2g0j6j34r8nan7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1019 (minimal fix — §1 of the issue).
Problem
sendAndProcessinsubnet/cmd/subnetctl/proxy.goswallowed anySendOnlyerror whose response was nil:A 403 "sender not in group" (the example from the issue — wrong creator key vs escrow ACL) was therefore indistinguishable from a transient network failure.
runInferencewaited throughRefusalTimeout, thenExecutionTimeout, and eventually returned a misleadinginsufficient timeout votes/inference timed out: TIMEOUT_REASON_REFUSEDerror.Operators saw minute-long waits and confusing timeout messages instead of an immediate 4xx with the real cause.
Fix
Classify HTTP transport errors by status code and fail fast on fatal 4xx responses. Retryable errors (5xx, 429, network) keep the existing deadline-based retry path so a transient blip still recovers.
Scope
This PR implements §1 of the issue (non-retryable fatal errors). The broader phase-aware retry policy (§2 post-start retries, §3 protocol continuation) is intentionally left for a follow-up — it is a larger design change that benefits from being reviewed on its own.
Changes
subnet/transport/errors.go(new):HTTPStatusErrortyped error withStatusCode/Path/BodyandIsFatal()/IsFatalHTTPError()helpers. Fatal = 4xx except 429 (Too Many Requests is explicitly retryable).subnet/transport/client.go:doPostRawreturns*HTTPStatusErroron non-200 instead of a plainfmt.Errorf, so callers can classify viaerrors.As.subnet/cmd/subnetctl/proxy.go:sendAndProcessnow propagates fatal HTTP errors immediately withhost rejected inference: ...wrapping. Retryable errors still return(false, 0, nil)to preserve the existing behavior.transport/errors_test.go: unit tests for the fatal/retryable classification table anderrors.Asunwrapping.cmd/subnetctl/proxy_test.go:killableClientgains aKillFatal()method that makesSendreturn*transport.HTTPStatusError{StatusCode: 403}.TestRunInference_FatalHTTPErrorReturnsImmediatelyis the red-green regression test for subnetctl: inference error handling #1019.Red-green proof
Without the fix (stashed
proxy.gochange, ran the new test):The 1.00s is the
RefusalTimeoutthe old code burned through before returning the misleading timeout error.With the fix:
The test asserts:
"host rejected inference"and"403", and not"timed out"/"insufficient votes".transport.IsFatalHTTPError(err)is true (error chain preserves the typed cause).elapsed < 500ms(would be 2s+ on the old path).StatusPending— the timeout machinery is correctly bypassed.Test plan
go test ./cmd/subnetctl/ ./transport/(all pass locally)TestRunInference_RefusalTimeoutunchanged — retryable errors still fall through to the timeout path.