|
| 1 | +# Adaptive Output Token Escalation Design |
| 2 | + |
| 3 | +> Reduces GPU slot over-reservation by ~4x through a "low default + escalate on truncation" strategy for output tokens. |
| 4 | +
|
| 5 | +## Problem |
| 6 | + |
| 7 | +Every API request reserves a fixed GPU slot proportional to `max_tokens`. The previous default of 32K tokens means each request reserves a 32K output slot, but 99% of responses are under 5K tokens. This over-reserves GPU capacity by 4-6x, limiting server concurrency and increasing cost. |
| 8 | + |
| 9 | +## Solution |
| 10 | + |
| 11 | +Use a capped default of **8K** output tokens. When a response is truncated (the model hits `max_tokens`), automatically retry once with an escalated limit of **64K**. Since <1% of requests are actually truncated, this reduces average slot reservation significantly while preserving output quality for long responses. |
| 12 | + |
| 13 | +## Architecture |
| 14 | + |
| 15 | +``` |
| 16 | + ┌─────────────────────────┐ |
| 17 | + │ Request starts │ |
| 18 | + │ max_tokens = 8K │ |
| 19 | + └───────────┬─────────────┘ |
| 20 | + │ |
| 21 | + ▼ |
| 22 | + ┌─────────────────────────┐ |
| 23 | + │ Stream response │ |
| 24 | + └───────────┬─────────────┘ |
| 25 | + │ |
| 26 | + ┌─────────┴─────────┐ |
| 27 | + │ │ |
| 28 | + finish_reason finish_reason |
| 29 | + != MAX_TOKENS == MAX_TOKENS |
| 30 | + │ │ |
| 31 | + ▼ ▼ |
| 32 | + ┌───────────┐ ┌─────────────────────┐ |
| 33 | + │ Done │ │ Check conditions: │ |
| 34 | + └───────────┘ │ - No user override? │ |
| 35 | + │ - No env override? │ |
| 36 | + │ - Not already │ |
| 37 | + │ escalated? │ |
| 38 | + └─────────┬───────────┘ |
| 39 | + YES │ NO |
| 40 | + ┌─────────┴────┐ |
| 41 | + │ │ |
| 42 | + ▼ ▼ |
| 43 | + ┌─────────────┐ ┌──────────┐ |
| 44 | + │ Pop partial │ │ Done │ |
| 45 | + │ model resp │ │ (truncd) │ |
| 46 | + │ from history│ └──────────┘ |
| 47 | + │ │ |
| 48 | + │ Yield RETRY │ |
| 49 | + │ event │ |
| 50 | + │ │ |
| 51 | + │ Re-send │ |
| 52 | + │ max_tokens │ |
| 53 | + │ = 64K │ |
| 54 | + └─────────────┘ |
| 55 | +``` |
| 56 | + |
| 57 | +## Token limit determination |
| 58 | + |
| 59 | +The effective `max_tokens` is resolved in the following priority order: |
| 60 | + |
| 61 | +| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior | |
| 62 | +| ----------- | ---------------------------------------------------- | ---------------------------- | --------------------- | ------------------------------ | |
| 63 | +| 1 (highest) | User config (`samplingParams.max_tokens`) | `min(userValue, modelLimit)` | `userValue` | No escalation | |
| 64 | +| 2 | Environment variable (`QWEN_CODE_MAX_OUTPUT_TOKENS`) | `min(envValue, modelLimit)` | `envValue` | No escalation | |
| 65 | +| 3 (lowest) | Capped default | `min(modelLimit, 8K)` | `min(32K, 8K)` = 8K | Escalates to 64K on truncation | |
| 66 | + |
| 67 | +A "known model" is one that has an explicit entry in `OUTPUT_PATTERNS` (checked via `hasExplicitOutputLimit()`). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits. |
| 68 | + |
| 69 | +This logic is implemented in three content generators: |
| 70 | + |
| 71 | +- `DefaultOpenAICompatibleProvider.applyOutputTokenLimit()` — OpenAI-compatible providers |
| 72 | +- `DashScopeProvider` — inherits `applyOutputTokenLimit()` from the default provider |
| 73 | +- `AnthropicContentGenerator.buildSamplingParameters()` — Anthropic provider |
| 74 | + |
| 75 | +## Escalation mechanism |
| 76 | + |
| 77 | +The escalation logic lives in `geminiChat.ts`, placed **outside** the main retry loop. This is intentional: |
| 78 | + |
| 79 | +1. The retry loop handles transient errors (rate limits, invalid streams, content validation) |
| 80 | +2. Truncation is not an error — it's a successful response that was cut short |
| 81 | +3. Errors from the escalated stream should propagate directly to the caller, not be caught by retry logic |
| 82 | + |
| 83 | +### Escalation steps (geminiChat.ts) |
| 84 | + |
| 85 | +``` |
| 86 | +1. Stream completes successfully (lastError === null) |
| 87 | +2. Last chunk has finishReason === MAX_TOKENS |
| 88 | +3. Guard checks pass: |
| 89 | + - maxTokensEscalated === false (prevent infinite escalation) |
| 90 | + - hasUserMaxTokensOverride === false (respect user intent) |
| 91 | +4. Pop the partial model response from chat history |
| 92 | +5. Yield RETRY event → UI discards partial output |
| 93 | +6. Re-send the same request with maxOutputTokens: 64K |
| 94 | +``` |
| 95 | + |
| 96 | +### State cleanup on RETRY (turn.ts) |
| 97 | + |
| 98 | +When the `Turn` class receives a RETRY event, it clears accumulated state to prevent inconsistencies: |
| 99 | + |
| 100 | +- `pendingToolCalls` — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated response |
| 101 | +- `pendingCitations` — cleared to avoid duplicate citations |
| 102 | +- `debugResponses` — cleared to avoid stale debug data |
| 103 | +- `finishReason` — reset to `undefined` so the new response's finish reason is used |
| 104 | + |
| 105 | +## Constants |
| 106 | + |
| 107 | +Defined in `tokenLimits.ts`: |
| 108 | + |
| 109 | +| Constant | Value | Purpose | |
| 110 | +| --------------------------- | ------ | ------------------------------------------------------- | |
| 111 | +| `CAPPED_DEFAULT_MAX_TOKENS` | 8,000 | Default output token limit when no user override is set | |
| 112 | +| `ESCALATED_MAX_TOKENS` | 64,000 | Output token limit used on truncation retry | |
| 113 | + |
| 114 | +## Design decisions |
| 115 | + |
| 116 | +### Why 8K default? |
| 117 | + |
| 118 | +- 99% of responses are under 5K tokens |
| 119 | +- 8K provides reasonable headroom for slightly longer responses without triggering unnecessary retries |
| 120 | +- Reduces average slot reservation from 32K to 8K (4x improvement) |
| 121 | + |
| 122 | +### Why 64K escalated limit? |
| 123 | + |
| 124 | +- Covers the vast majority of long outputs that were truncated at 8K |
| 125 | +- Matches the output limit of many modern models (Claude Sonnet, Gemini 3.x, Qwen3.x) |
| 126 | +- Higher values (e.g., 128K) would negate slot optimization benefits for the <1% of requests that escalate |
| 127 | + |
| 128 | +### Why not progressive escalation (8K → 16K → 32K → 64K)? |
| 129 | + |
| 130 | +- Each retry adds latency (the full response must be regenerated) |
| 131 | +- A single retry is the simplest approach that captures almost all cases |
| 132 | +- The <1% truncation rate at 8K means almost no requests need escalation; those that do are likely to need significantly more than 16K |
| 133 | + |
| 134 | +### Why is escalation outside the retry loop? |
| 135 | + |
| 136 | +- Truncation is a success case, not an error |
| 137 | +- Errors from the escalated stream (rate limits, network failures) should propagate directly rather than being silently retried with incorrect parameters |
| 138 | +- Keeps the retry loop focused on its original purpose (transient error recovery) |
0 commit comments