Skip to content

fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout#1035

Open
unameisfine wants to merge 1 commit intogonka-ai:mainfrom
unameisfine:bugfix/subnetctl-propagate-fatal-http-errors
Open

fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout#1035
unameisfine wants to merge 1 commit intogonka-ai:mainfrom
unameisfine:bugfix/subnetctl-propagate-fatal-http-errors

Conversation

@unameisfine
Copy link
Copy Markdown

Closes #1019 (minimal fix — §1 of the issue).

Problem

sendAndProcess in subnet/cmd/subnetctl/proxy.go swallowed any SendOnly error whose response was nil:

resp, sendErr := p.session.SendOnly(ctx, prepared)
if sendErr != nil && resp == nil {
    return false, 0, nil   // fatal 403 is indistinguishable from a network blip
}

A 403 "sender not in group" (the example from the issue — wrong creator key vs escrow ACL) was therefore indistinguishable from a transient network failure. runInference waited through RefusalTimeout, then ExecutionTimeout, and eventually returned a misleading insufficient timeout votes / inference timed out: TIMEOUT_REASON_REFUSED error.

Operators saw minute-long waits and confusing timeout messages instead of an immediate 4xx with the real cause.

Fix

Classify HTTP transport errors by status code and fail fast on fatal 4xx responses. Retryable errors (5xx, 429, network) keep the existing deadline-based retry path so a transient blip still recovers.

Scope

This PR implements §1 of the issue (non-retryable fatal errors). The broader phase-aware retry policy (§2 post-start retries, §3 protocol continuation) is intentionally left for a follow-up — it is a larger design change that benefits from being reviewed on its own.

Changes

  • subnet/transport/errors.go (new): HTTPStatusError typed error with StatusCode / Path / Body and IsFatal() / IsFatalHTTPError() helpers. Fatal = 4xx except 429 (Too Many Requests is explicitly retryable).
  • subnet/transport/client.go: doPostRaw returns *HTTPStatusError on non-200 instead of a plain fmt.Errorf, so callers can classify via errors.As.
  • subnet/cmd/subnetctl/proxy.go: sendAndProcess now propagates fatal HTTP errors immediately with host rejected inference: ... wrapping. Retryable errors still return (false, 0, nil) to preserve the existing behavior.
  • Tests:
    • transport/errors_test.go: unit tests for the fatal/retryable classification table and errors.As unwrapping.
    • cmd/subnetctl/proxy_test.go:
      • killableClient gains a KillFatal() method that makes Send return *transport.HTTPStatusError{StatusCode: 403}.
      • TestRunInference_FatalHTTPErrorReturnsImmediately is the red-green regression test for subnetctl: inference error handling #1019.

Red-green proof

Without the fix (stashed proxy.go change, ran the new test):

Error: "inference 1 timed out: TIMEOUT_REASON_REFUSED" does not contain "host rejected inference"
--- FAIL: TestRunInference_FatalHTTPErrorReturnsImmediately (1.00s)

The 1.00s is the RefusalTimeout the old code burned through before returning the misleading timeout error.

With the fix:

--- PASS: TestRunInference_FatalHTTPErrorReturnsImmediately (0.00s)

The test asserts:

  • Error contains "host rejected inference" and "403", and not "timed out" / "insufficient votes".
  • transport.IsFatalHTTPError(err) is true (error chain preserves the typed cause).
  • elapsed < 500ms (would be 2s+ on the old path).
  • Inference stays StatusPending — the timeout machinery is correctly bypassed.

Test plan

  • go test ./cmd/subnetctl/ ./transport/ (all pass locally)
  • Existing TestRunInference_RefusalTimeout unchanged — retryable errors still fall through to the timeout path.

…eout

Before this change, sendAndProcess swallowed any SendOnly error whose
response was nil, returning (false, 0, nil) regardless of the underlying
cause. A 403 "sender not in group" from the host auth layer — a
misconfiguration that cannot succeed on retry — was therefore indistin-
guishable from a brief network blip, and runInference waited through
RefusalTimeout and ExecutionTimeout before eventually returning a
misleading "insufficient timeout votes" error.

Classify HTTP transport errors by status code and fail fast on fatal
4xx responses (400/401/403/404/422/...), while keeping 5xx, 429 and
network-level failures on the existing deadline-based retry path so
transient issues still recover.

Changes:

- transport: add HTTPStatusError typed error with StatusCode/Path/Body
  and IsFatal() / IsFatalHTTPError() helpers (4xx except 429 = fatal).
- transport: doPostRaw returns *HTTPStatusError on non-200 so callers
  can classify via errors.As.
- subnetctl/proxy: sendAndProcess propagates fatal HTTP errors
  immediately with "host rejected inference: ..." wrapping, preserving
  the status code in the error chain for future caller mapping.
- tests: red-green regression test that fails without the fix (waits
  through RefusalTimeout and returns a timeout error) and passes with
  the fix (returns in <500ms with the 403 cause visible, inference
  stays Pending instead of being marked TimedOut). Unit tests cover
  the status classification table.

Closes gonka-ai#1019
Copy link
Copy Markdown

@Doog-bot534 Doog-bot534 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: fix(subnetctl): propagate fatal HTTP errors instead of waiting on timeout

Strong Approve

The cleanest PR of the batch. Well-scoped, well-tested, and the abstraction (`HTTPStatusError` + `IsFatal()`) is reusable and properly layered.

Strengths

  • Proper structured error with `errors.As` support. The `IsFatal()` classification (4xx fatal except 429) matches HTTP semantics.
  • Backward-compatible change — existing `err.Error()` callers see the same string format.
  • Excellent test: asserts error message content, error chain unwrapping, timing (<500ms vs 2s+), AND state consistency (inference stays Pending).

Minor suggestions

  1. Response body truncation: `doPostRaw` reads the entire response body into the error via `io.ReadAll`. A malicious host could return a multi-MB error body. Consider truncating `Body` to ~4KB in `HTTPStatusError`.

  2. HTTP 408 Request Timeout: Currently classified as fatal (4xx). It's technically retryable. Edge case — hosts probably don't return 408 — but worth considering for completeness.

Verdict: Ship it. Clean design, strong tests, well-scoped change.

Payout address: gonka10zaal553duxp05nvfpqtsqrm2g0j6j34r8nan7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

subnetctl: inference error handling

2 participants