Skip to content

fix: harden JS HTTP downloader resume, retry and stall handling#2399

Open
Moyasee wants to merge 7 commits into
mainfrom
fix/js-downloader
Open

fix: harden JS HTTP downloader resume, retry and stall handling#2399
Moyasee wants to merge 7 commits into
mainfrom
fix/js-downloader

Conversation

@Moyasee

@Moyasee Moyasee commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

When submitting this pull request, I confirm the following (please check the boxes):

  • I have read the Hydra documentation.
  • I have checked that there are no duplicate pull requests related to this request.
  • I have considered, and confirm that this submission is valuable to others.
  • I accept that this submission may not be used and the pull request may be closed at the discretion of the maintainers.

Fill in the PR content:

Reworks the native HTTP downloader so direct downloads stop stalling, freezing and growing past the real file size on some hosters. Closes the two reported issues and a handful of related edge cases found along the way.

Fixes #2388 (downloads adding gigs) and LBX-723 (DDL hosters freezing).

Size doubling (#2388)

The downloader resumed by stat-ing the partial and sending Range, but never checked the server honored it. When a host ignores Range and replies 200 with the full body, the old code appended the whole file on top of the partial (140GB became ~300GB) and reported startByte + contentLength as the total.

  • On a 200 resume, stream-skip the bytes already on disk and append only the remainder, keeping the partial intact instead of duplicating it.
  • Report contentLength (not startByte + contentLength) as the total for a 200.
  • Align to the server's Content-Range start on a 206 so an early/overlapping partial cannot duplicate bytes, and restart cleanly if it would leave a gap.
  • Refuse to append a body shorter than the existing partial.

Freezing / stalling (LBX-723)

undici surfaces a dropped socket as TypeError: terminated with the real code on err.cause. The old retry check only looked at err.message/err.code, so a recoverable disconnect was treated as fatal and the download stopped (pause + resume was the only workaround).

  • Treat terminated, err.cause.code and the UND_ERR_* family as retryable.
  • Reset the retry budget once a meaningful amount of new data has flowed in an attempt, with an absolute cap so a genuinely dead link still gives up.
  • Detect stalls by how long a single read has been blocked (not time since the last chunk) and raise the timeout, so a slow-but-alive transfer is not aborted. A download that truly cannot progress now raises an error and lets the queue advance instead of sitting "active" at 0 B/s.

Transient HTTP errors

408/429/500/502/503/504 were fatal, so a rate-limited link (common on Datanodes) failed to even start. They now back off and retry in both the preflight check and the downloader; permanent 4xx (403/404/410) still fail fast.

Other hardening

  • Send Accept-Encoding: identity and bail out of size-based resume under transfer compression.
  • Cancel the response reader on teardown to avoid leaking sockets / file descriptors.
  • Remember the Content-Disposition filename across retries so a resume targets the right partial.
  • Clamp reported progress, fix AllDebrid batch accounting, route batch failures through the error handler, and add a generation token so a cancel during preparation cannot spawn a second writer on the same file.

Testing

  • Pure decision logic extracted into a helpers module with unit tests (yarn test, 41 cases).
  • Verified end-to-end against a local server reproducing every observed behavior (200-on-resume, mid-stream terminated, 206 overlap, short body, 429, pause/resume, throttle, chunked) with byte-for-byte (SHA-256) checks.
  • Verified against live pixeldrain and Datanodes downloads: partial download, pause, and 206 resume are byte-identical to the server with no corruption at the boundary.

Fixes the direct downloader stalling, freezing and ballooning past the
real file size on some hosters (GH-2388, LBX-723).

Resume / size doubling (GH-2388):
- When a host ignores the Range header and answers 200 with the full
  body, stream-skip the bytes we already have and append only the
  remainder instead of writing the whole file on top of the partial.
- Stop reporting startByte + contentLength as the total on a 200 resume.
- Align to the server's Content-Range start on 206 so an early/overlapping
  partial does not duplicate bytes; restart cleanly if it would gap.
- Refuse to append a body that is shorter than the existing partial.

Freezing / stalling (LBX-723):
- Treat undici "terminated" and the real code on err.cause (UND_ERR_*,
  ECONNRESET, etc.) as retryable so a dropped socket resumes instead of
  failing the whole download.
- Reset the retry budget once a meaningful amount of new data has flowed
  in an attempt, capped so a dead link still gives up.
- Base stall detection on time spent waiting inside a single read and
  raise the timeout, so a slow-but-alive transfer is not aborted; a
  truly stuck download now surfaces an error and lets the queue advance.

Transient HTTP errors:
- Retry 408/429/500/502/503/504 with backoff in both the downloader and
  the preflight check so a rate-limited link no longer fails to start;
  permanent 4xx still fail fast.

Other:
- Force Accept-Encoding: identity and bail out of size-based resume under
  transfer compression.
- Cancel the response reader on teardown to avoid leaking sockets.
- Remember the Content-Disposition filename across retries.
- Clamp progress, guard AllDebrid batch accounting, route batch failures
  through the error handler, and add a generation token so a cancel during
  preparation cannot spawn a second writer on the same file.
- Extract the pure decision helpers and cover them with unit tests.
@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR hardens the JS HTTP downloader by fixing file-size doubling on servers that ignore Range headers (200-on-resume now stream-skips already-downloaded bytes instead of appending a full duplicate body), and by making disconnects that surface as TypeError: terminated retryable instead of fatal. It also introduces a retry budget that resets when meaningful data flows, stall detection based on how long a single read() has been blocked, and a generation token to prevent a cancelled preparation from spawning a second downloader.

  • Resume correctness: resolveResumeAction and applySkip replace the old naive append-on-206 logic; computeFileSize reports the server's total rather than startByte + contentLength; body shorter than the partial is rejected.
  • Retry hardening: isRetryableDownloadError covers undici terminated/UND_ERR_*, honours an explicit retryable: false flag to prevent permanent failures from looping, and isRetryableHttpStatus makes 429/500/502/503/504 retry in both preflight and the downloader itself.
  • Leak prevention: the ReadableStream reader is now cancelled via a destroy hook, writeStream.destroy() replaces close(), and isDownloading is reset in an outer finally that survives unexpected throws.

Confidence Score: 5/5

The change is safe to merge: it fixes two documented data-corruption and stall bugs without regressing any existing paths, and the pure-function helpers are fully unit-tested.

All critical paths (resume, retry, stall, cancel-during-prepare) are covered by either unit tests or explicit integration scenarios. The generation-token race fix is logically sound in a single-threaded JS runtime. The helper extraction makes all decision logic independently verifiable, and the 41-case test suite directly exercises each branch. No defects were found in the changed code.

No files require special attention. js-http-downloader.ts is the most complex file but its key decision logic has been moved to well-tested helpers, and the new stream lifecycle follows correct Node.js stream contracts.

Important Files Changed

Filename Overview
src/main/services/download/js-http-downloader-helpers.ts New helpers module: pure functions for retry classification, resume action resolution, skip logic, stall detection, and progress clamping — all well-tested and correctly implemented
src/main/services/download/js-http-downloader-helpers.test.ts 41 unit test cases covering all helper functions, including the new retryable:false early-return and edge cases for skip, stall detection, and progress clamping
src/main/services/download/js-http-downloader.ts Major rework: proper 200-on-resume skip logic, reader.cancel() on destroy, resolvedFilename pinned across retries, retry budget resets, stall detection based on pending-read timestamp, and outer finally ensuring isDownloading is always cleared
src/main/services/download/download-manager.ts Adds generation token for cancel-during-preparation race, preflight retry loop for transient HTTP statuses, clampProgress on progress updates, and routes AllDebrid batch failures through the error handler

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[startDownload] --> B[startDownloadWithRetry loop]
    B --> C[prepareDownloadPath]
    C --> D[buildRequestHeaders]
    D --> E[executeDownload]
    E --> F{HTTP status?}
    F -->|416 + local complete| G[status=complete]
    F -->|400+ retryable 429/500/502/503/504| H[HttpDownloadStatusError retryable=true]
    F -->|400+ permanent 403/404/410| I[fatal error]
    F -->|200 or 206| J[resolveResumeAction]
    J -->|200 + startByte>0| K[flags=a skipBytes=startByte]
    J -->|206 overlap| L[flags=a skipBytes=overlap]
    J -->|206 gap| M[flags=w restart]
    J -->|206 aligned| N[flags=a skipBytes=0]
    K & L & M & N --> O[createReadableStream]
    O --> P{done?}
    P -->|remainingToSkip>0| Q[destroy: body too short]
    P -->|shouldWrite=false| O
    P -->|shouldWrite=true| R[push chunk]
    R --> O
    P -->|done + skip=0| S[push null complete]
    H --> T[handleDownloadErrorWithRetry]
    T -->|canRetry| U[backoff retry]
    T -->|stall exhausted| V[fatal stall error]
    T -->|permanent| W[fatal error]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[startDownload] --> B[startDownloadWithRetry loop]
    B --> C[prepareDownloadPath]
    C --> D[buildRequestHeaders]
    D --> E[executeDownload]
    E --> F{HTTP status?}
    F -->|416 + local complete| G[status=complete]
    F -->|400+ retryable 429/500/502/503/504| H[HttpDownloadStatusError retryable=true]
    F -->|400+ permanent 403/404/410| I[fatal error]
    F -->|200 or 206| J[resolveResumeAction]
    J -->|200 + startByte>0| K[flags=a skipBytes=startByte]
    J -->|206 overlap| L[flags=a skipBytes=overlap]
    J -->|206 gap| M[flags=w restart]
    J -->|206 aligned| N[flags=a skipBytes=0]
    K & L & M & N --> O[createReadableStream]
    O --> P{done?}
    P -->|remainingToSkip>0| Q[destroy: body too short]
    P -->|shouldWrite=false| O
    P -->|shouldWrite=true| R[push chunk]
    R --> O
    P -->|done + skip=0| S[push null complete]
    H --> T[handleDownloadErrorWithRetry]
    T -->|canRetry| U[backoff retry]
    T -->|stall exhausted| V[fatal stall error]
    T -->|permanent| W[fatal error]
Loading

Reviews (2): Last reviewed commit: "fix: make retryable flag authoritative a..." | Re-trigger Greptile

Comment thread src/main/services/download/js-http-downloader-helpers.ts
Comment thread src/main/services/download/download-manager.ts Outdated
Comment thread src/main/services/download/download-manager.ts Outdated
Address review feedback:
- isRetryableDownloadError now returns false when an error is explicitly
  tagged retryable: false, so a permanent failure whose message happens to
  contain "connection"/"timeout"/etc. is no longer retried.
- Extract the preflight backoff base delay into a named constant.
@Moyasee

Moyasee commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Addressed both points in 58432fe:

  • isRetryableDownloadError now early-returns false when an error is explicitly tagged retryable: false, so the message-fragment fallback can't override a permanent failure (added a unit test for it).
  • Pulled the preflight backoff delay out into PREFLIGHT_RETRY_BASE_DELAY_MS.

@greptile review

A tester hit a GoFile CDN that rate-limited every request (HTTP 429). The
transient-status retry then kept retrying for about 2.5 minutes while the
UI still showed an active download, which looked like a stall.

Give transient HTTP statuses (429/503/etc.) a short dedicated retry budget
separate from the network/stall budget, so a rate-limited or temporarily
unavailable server fails within a few seconds with a clear message and the
queue can move on instead of sitting on a dead link. Honor the Retry-After
header (delta-seconds or HTTP-date, capped) when the server sends one.
@Moyasee

Moyasee commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Investigated koddy's tester logs (reported stall).

Root cause: it was not a mid-stream stall. He repeatedly hit GoFile, which was failing for him (official 403, alternate CDN gofilecdn.eu.cc returning 400/429). On the 429s, the new transient-status retry kept retrying for ~2.5 minutes (1s -> 15s backoff x10) while the download still showed as active, so it looked frozen. His other downloads (Celeste, The Forest on other hosts) completed cleanly, no actual stall.

Fix in 1d705bf: transient HTTP statuses (429/503/etc.) now get a short dedicated retry budget separate from network/stall retries, so a hard rate-limited or unavailable link fails within a few seconds with a clear "rate-limiting or temporarily unavailable" message and the queue moves on, instead of masquerading as a stalled download. Also honors Retry-After (seconds or HTTP-date, capped). Added unit + integration coverage (persistent-429 now fails bounded/fast; 429-then-200 still recovers; 403 still fails fast).

Follow-up worth considering separately (not in this PR, it is in the GoFile resolver, not the downloader): when the alternate GoFile CDN returns 4xx/429, fall back to the official GoFile source instead of dead-ending on the bypass CDN.

@Doobies

Doobies commented Jun 24, 2026

Copy link
Copy Markdown

Investigated koddy's tester logs (reported stall).

Root cause: it was not a mid-stream stall. He repeatedly hit GoFile, which was failing for him

i did nothing! I either clicked on the black space on the Hydra app because I clicked another app or alt-tabbed back into Hydra. When i did hit the gofile was to make sure TorBox caught the hook. This didn't happen because I went back into it and redownloaded the same file. The doubling of the GBs happened while I was watching it and didn't have any peripherals in my hands

@Moyasee

Moyasee commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

Investigated koddy's tester logs (reported stall).

Root cause: it was not a mid-stream stall. He repeatedly hit GoFile, which was failing for him

i did nothing! I either clicked on the black space on the Hydra app because I clicked another app or alt-tabbed back into Hydra. When i did hit the gofile was to make sure TorBox caught the hook. This didn't happen because I went back into it and redownloaded the same file. The doubling of the GBs happened while I was watching it and didn't have any peripherals in my hands

Hey! This text you are quoting is not related to your issue, no worries
But the issue that you reported was also fixed in this PR

Moyasee added 4 commits June 26, 2026 17:35
Detect connectivity changes from the renderer (online/offline plus
NetworkInformation change events) and from powerMonitor wake, and route
them through the download orchestrator. On a network switch or reconnect
the active HTTP download aborts the half-open socket and resumes from the
on-disk offset instead of waiting out the 30s stall timeout.

A real disconnect surfaces a Reconnecting indicator in the bottom bar,
downloads page and big picture, and after a 15s grace window with no
connection the download is paused with its partial file kept.
Extract the retry-budget reset and transient-status handling out of the
JS downloader error handler, and split preflight validation into header
building and a per-attempt helper, keeping each function under the
cognitive-complexity limit. Drop a negated ternary and use globalThis in
the preload network listeners.
A network change usually surfaces as a dropped socket or a stall, both
of which recover through the generic retry branch. That branch resumed
without marking the download as reconnecting, so the indicator only
appeared on the connectivity-event path and users saw the download jump
straight to recalculating the ETA. Mark the download reconnecting (and
zero the speed) on any network retry, and clear it on a terminal error.
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Downloads adding gigs to game

2 participants