fix(formdata): multipart parser corrupts parts located past 4 GiB#30609
fix(formdata): multipart parser corrupts parts located past 4 GiB#30609robobun wants to merge 6 commits into
Conversation
WalkthroughFormData multipart parsing switches from buffer-offset ChangesFormData multipart parsing with raw slices and large-offset support
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/runtime/webcore/FormData.zig`:
- Line 332: The parsing loop in FormData.zig currently exits early when both
filename and name are found (condition using header_chunk.len > 0 and (filename
== null or name.len == 0)), which skips subsequent headers like Content-Type and
leaves field.content_type empty; update the loop(s) (the header parsing while
using header_chunk, filename, name and the similar blocks around the other
occurrences) to continue scanning all header lines until header_chunk.len == 0
(i.e., remove the early-exit dependency on filename/name), and ensure that when
a "Content-Type" header is encountered you set field.content_type appropriately
so file parts retain their content type (apply same change to the other similar
blocks referenced).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 605edc8f-a6a8-4e88-9564-83e94b6b122e
📒 Files selected for processing (3)
src/jsc/array_buffer.zigsrc/runtime/webcore/FormData.zigtest/js/web/html/FormData.test.ts
|
This PR may be a duplicate of:
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/web/html/FormData.test.ts`:
- Around line 2-3: The test currently uses os.totalmem() to decide skipping
which checks installed RAM; change the skip gating to use the effective memory
ceiling instead (e.g., use v8.getHeapStatistics().heap_size_limit or process
resource limits or a new harness helper) so the FormData.test.ts skip is
deterministic under cgroup/job memory caps; locate the check that imports
totalmem and replace the logic that references totalmem with a call to
v8.getHeapStatistics().heap_size_limit (or the harness helper) and base the
skip/skipIf on that value (update any references to totalmem in the test file
accordingly).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 1c8fdad6-7d2c-4275-b642-19cc32e4e870
📒 Files selected for processing (1)
test/js/web/html/FormData.test.ts
There was a problem hiding this comment.
LGTM — mechanical bun.Semver.String → []const u8 widening that mirrors what Field.value already went through, with a gated regression test.
Extended reasoning...
Overview
This PR touches three files: src/runtime/webcore/FormData.zig (multipart parser), src/jsc/array_buffer.zig (one-line cast removal), and test/js/web/html/FormData.test.ts (new regression test). The core change converts Field.filename, Field.content_type, and the local name from bun.Semver.String (which packs offset/length into u32 with a tag bit) to raw []const u8 slices, matching what was already done for Field.value in #27483. The subslicer, the buf callback parameter, and the unused string type alias are dropped accordingly. The ArrayBuffer.fromBytes change removes a vestigial @as(u32, @intCast(...)) whose targets (.len, .byte_len) are already usize.
Security risks
The multipart parser handles untrusted request bodies, but this change is strictly a type widening — slices that previously round-tripped through a lossy u32 offset/length encoding now stay as native fat pointers into the same input buffer. No new parsing logic, no new indexing arithmetic, no allocation changes. The fromBytes change removes a debug-build panic on >4 GiB inputs without altering release behavior. I see no injection, OOB, or auth surface introduced.
Level of scrutiny
Low–medium. The transformation is mechanical (.isEmpty() → .len == 0, .len() → .len, .slice(buf) → direct slice, subslicer.sub(x).value() → x) and follows the exact pattern already applied to Field.value. Grep confirms forEachMultipartEntry and FormData.Field are only referenced within FormData.zig, so the callback signature change is self-contained.
Other factors
The bug-hunting system found no issues. All inline review threads are resolved: the os.totalmem() gating concern was addressed in 8de77d5 with process.constrainedMemory(), and both CodeRabbit and my own note about the pre-existing Content-Type early-exit on line 332 were acknowledged as out of scope (the diff there is purely .len() → .len). The new test is appropriately gated (skipped on Windows / <16 GiB) and runs in a subprocess. The musl build failures in CI appear to be infrastructure-related rather than caused by this diff. No CODEOWNERS apply.
CI statusThe diff is green. Every red lane is fleet-wide flake or infra, none intersecting the multipart parser. Build #60818 (8500af0, current HEAD — rebased onto main after #31783)65 lanes pass, cargo clippy passes. Red lanes:
Previous builds (#57552: 71 pass / 0 test failures; earlier ones documented in edit history) show the same pattern: the diff's own tests pass everywhere they run. The PR touches Ready for merge — needs a maintainer to retry the flaky lanes or override. |
|
Rebased onto main (post-Rust-rewrite) and ported:
Gate verified: without the (Unrelated: |
1dc0c33 to
31d5a69
Compare
There was a problem hiding this comment.
No issues found and all prior feedback addressed — the fix is a clean mechanical swap from bun.Semver.String to native slices following the pattern Field.value already uses. Deferring to a maintainer only because this touches the multipart body parser (untrusted-input path) across both the Zig reference and the freshly-ported Rust implementation, and the latter adds new unsafe raw-pointer derefs (same contract as the pre-existing field.value, but worth a human glance).
Extended reasoning...
Overview
This PR fixes #21490: req.formData() corrupts part names/filenames for any multipart part whose header sits past byte 2³² in the body. Root cause is that forEachMultipartEntry stored name/filename/content_type as bun.Semver.String, which packs offset/length into a u32/u31 pair and silently wraps. The fix swaps these to raw []const u8 (Zig) / *const [u8] (Rust) slices into the input buffer — exactly the pattern Field.value already uses since #27483. It also drops a leftover u32 round-trip cast in ArrayBuffer.fromBytes/from_bytes whose target fields are already usize. A new memory-gated, Windows-skipped regression test in FormData.test.ts constructs a 4 GiB + 256 B body and asserts the trailing parts round-trip.
Files: src/runtime/webcore/FormData.{zig,rs}, src/jsc/array_buffer.{zig,rs}, test/js/web/html/FormData.test.ts.
Security risks
The multipart parser handles untrusted HTTP request bodies, so this is a security-adjacent code path. However, the change does not alter parsing logic, boundary handling, or header tokenization — it only changes how already-parsed subslices are stored (native fat pointer instead of u32-packed offset). The fix strictly removes a silent integer truncation; it cannot introduce new OOB reads because the slices are direct subslices of input produced by the same indexing as before. The Rust side's new unsafe { &*field.content_type } / &*field.filename derefs follow the identical SAFETY contract already in place for field.value (always either the b"" static or a subslice of the caller-owned input, valid for the iterator call). No auth/crypto/permissions code is touched.
Level of scrutiny
Moderate. The transformation is mechanical and the diff is small, but it spans (a) a parser on the untrusted-input path and (b) two parallel implementations — the Zig reference and the recently-landed Rust port. The Rust change adds three unsafe blocks. While each mirrors the pre-existing field.value: *const [u8] pattern and the TODO(port): lifetime comment explains why &'a [u8] isn't used yet, I'd prefer a maintainer confirm the Rust lifetime story rather than rubber-stamp it.
Other factors
forEachMultipartEntry/for_each_multipart_entryhave no external callers (grep confirms both are file-local), so the iterator signature change is self-contained.- No CODEOWNERS apply to the touched paths.
- All prior review feedback (CodeRabbit's cgroup-memory gate, my
require("os")→ static import nit, the pre-existing Content-Type early-exit) has been addressed or explicitly scoped out, and all inline threads are resolved. - CI was green on the previous push except for an unrelated fleet-wide Windows
test-http-should-emit-close-when-connection-is-abortedflake; build #54342 for the latest commit is in progress. - The bug-hunting system found no issues on the current revision.
d11a1e1 to
fdb922c
Compare
There was a problem hiding this comment.
LGTM — mechanical Semver.String → raw-slice swap that mirrors what #27483 already did for Field.value, plus a dead u32 cast removal; all prior nits addressed.
Extended reasoning...
Overview
Touches src/runtime/webcore/FormData.{zig,rs} (multipart parser), src/jsc/array_buffer.{zig,rs}, and test/js/web/html/FormData.test.ts. The core change converts Field.{filename,content_type} and the per-part name from bun.Semver.String / bun_semver::String (which packs offset/len as u32 with bit 63 stolen as a tag) to raw []const u8 / *const [u8] slices into the input buffer. The iterator callback signature drops the now-unused buf back-reference and SlicedString subslicer. The array_buffer change removes a vestigial @as(u32, @intCast(...)) / u32::try_from().expect() round-trip on fields that were already usize. A memory-gated, subprocess-isolated regression test exercises a 4 GiB+ multipart body.
Security risks
None introduced. The multipart parser does process untrusted HTTP bodies, but this change doesn't alter parsing logic or bounds computation — it only changes how already-computed subslices are stored (native fat pointer instead of a packed u32 offset/len that silently wrapped at 4 GiB). The slices point into the same caller-owned buffer with identical bounds; the only behavioral delta is that offsets past 2³² no longer truncate. The Rust *const [u8] + unsafe { &* } pattern is the same idiom Field.value already uses (ported in #27483), so no new unsafe surface is being introduced — two more fields adopt an existing, documented invariant.
Level of scrutiny
Low–medium. The transformation is mechanical and follows an established precedent (Field.value underwent the identical conversion in #27483). I verified via grep that forEachMultipartEntry / for_each_multipart_entry and the Field struct have no consumers outside these two files, so the signature change is fully self-contained. The array_buffer hunk is a strict simplification (the target fields are usize; the cast was a debug-build panic waiting to happen and a no-op in release). No CODEOWNERS cover these paths.
Other factors
All review threads are resolved: my pre-existing Content-Type early-exit note was acknowledged as out of scope; CodeRabbit's cgroup-memory gate suggestion was applied (process.constrainedMemory?.() || totalmem()); my require("os") → static-import nit was fixed in fdb922c. The bug-hunting system found nothing this round. CI on the prior HEAD was green modulo a fleet-wide fetch-tcp-keepalive flake unrelated to this diff, and the new regression test passed on every lane where it ran.
fdb922c to
28f675b
Compare
There was a problem hiding this comment.
LGTM — mechanical Semver.String → raw slice swap matching the existing Field.value pattern; all prior feedback addressed.
Extended reasoning...
Overview
This PR fixes #21490 by changing how the multipart FormData parser stores part metadata. forEachMultipartEntry previously stored name, filename, and content_type as bun.Semver.String (which packs offset/length into u32 fields, wrapping at 4 GiB), and now stores them as raw []const u8 / *const [u8] slices into the input buffer — exactly the same representation Field.value already uses since #27483. The buf callback parameter and SlicedString subslicer become unnecessary and are dropped. A separate one-line change in array_buffer.{zig,rs} removes a vestigial u32 round-trip on fields that are already usize. A new regression test in FormData.test.ts constructs a >4 GiB multipart body in a subprocess and asserts trailing parts round-trip correctly; it's gated on process.constrainedMemory()/totalmem() ≥ 16 GiB and skipped on Windows.
Security risks
None introduced. The multipart parser handles untrusted HTTP bodies, but this change strictly removes a lossy u32 packing step — the same subslices of the same caller-owned input buffer flow through the same logic, just without offset truncation. No new dereferences of attacker-controlled offsets; the Rust unsafe { &*field.filename } pattern is identical to the pre-existing unsafe { &*field.value } pattern with the same SAFETY invariant (points into the caller-owned input or b"").
Level of scrutiny
Low-to-medium. The diff is a mechanical type substitution (.slice(buf) → direct use, .isEmpty() → .len > 0, .len() → .len) following an already-landed pattern from #27483. forEachMultipartEntry / for_each_multipart_entry and FormData.Field have no callers outside the two modified files, so the signature change is fully self-contained. The array_buffer change is a strict simplification (removing @as(u32, @intCast(x)) / u32::try_from(x).expect() as usize where x is already usize and the target field is usize).
Other factors
All four review threads are resolved: the CodeRabbit loop-condition comment and my own pre-existing-bug note were correctly scoped out (the .len() → .len change is purely mechanical, the early-exit behavior is unchanged from main); the cgroup-memory gate was addressed with process.constrainedMemory?.(); and my require("os") nit was fixed with a static import. The author's CI summary reports the diff green across 65+ lanes with only a fleet-wide fetch-tcp-keepalive flake unrelated to this change. No CODEOWNERS cover these paths. The bug-hunting system found no issues.
28f675b to
65dce7d
Compare
There was a problem hiding this comment.
No issues found and both earlier nits are addressed — deferring only because this touches the multipart body parser (untrusted input) across both the Zig and Rust implementations, which I'd rather a human sign off on.
Extended reasoning...
Overview
This PR fixes #21490 by changing how forEachMultipartEntry stores part metadata (name, filename, content_type): from bun.Semver.String / bun_semver::String — which packs offset/length into 32-bit fields and silently wraps past 4 GiB — to native []const u8 / *const [u8] slices. It applies the same change in parallel to src/runtime/webcore/FormData.zig and src/runtime/webcore/FormData.rs, drops the now-unused subslicer/buf plumbing, removes a redundant u32 round-trip in ArrayBuffer.fromBytes / ArrayBuffer::from_bytes (target fields are already usize), and adds a memory-gated regression test in test/js/web/html/FormData.test.ts.
The transformation is mechanical and follows the exact pattern already applied to Field.value in #27483: subslicer.sub(x).value() → x, .len() → .len, .isEmpty() → .len == 0 / .is_empty(), .slice(buf) → direct use. The Rust side adds the required per-block SAFETY: comments for the *const [u8] derefs, and the Default impl keeps them initialised to b"" so they are always valid fat pointers. The control-character Content-Type guard from the recent hardening PR is preserved.
Security risks
Multipart parsing operates on untrusted HTTP request bodies, so it is security-adjacent. However, this PR does not introduce or alter any parsing logic, bounds checks, or allocations — it only widens the storage representation of already-computed subslices from u32-packed offsets to native fat pointers. The array_buffer change removes a debug-build panic on >4 GiB buffers without changing release behaviour. I see no new injection, OOB, or lifetime exposure; the Rust *const [u8] fields borrow into the caller-owned input for exactly the duration of the iterator callback, matching the existing Field.value contract.
Level of scrutiny
Medium-high. The change itself is locally verifiable and semantically equivalent modulo the bug being fixed, but it spans the production multipart parser in two coordinated language implementations and changes the public forEachMultipartEntry iterator signature. That combination is past my threshold for auto-approval of code on an untrusted-input path.
Other factors
The bug-hunting system found nothing. Both of my earlier inline comments (the pre-existing Content-Type early-exit, scoped out as orthogonal; and the require("os") → static import nit) are resolved, as is CodeRabbit's cgroup-aware memory-gate suggestion. CI is reported green across 71 lanes with only expired darwin agents outstanding. The new test is sensibly gated (process.constrainedMemory() with totalmem() fallback, skip on Windows) and runs in a subprocess. No CODEOWNERS match the touched paths.
The multipart parser stored part name, filename, and content-type as bun.Semver.String, which packs offset and length into 32-bit fields (with bit 31 of length stolen as a tag). For any part whose header sat past 4 GiB in the request body, the offset wrapped and the parser read garbage — field names came back as bytes from the middle of the preceding file body and the part was unreachable by name. This is the remaining half of the bug behind #21490. The file body itself (Field.value) was already switched to a raw slice in #27483, which fixed the 2 GiB length truncation, but name/filename/content_type were still u32-indexed. Switch all Field slices to raw []const u8 and drop the subslicer. Also remove a leftover @as(u32, @intcast(bytes.len)) in ArrayBuffer.fromBytes whose target fields are usize — it panicked on >4 GiB buffers in debug builds on the client-side FormData serialization path. Fixes #21490
os.totalmem() reports physical RAM, so a containerized runner with a lower cgroup cap would still try to run the ~10 GiB test and get OOM-killed. Use process.constrainedMemory() (cgroup-aware) with a fallback to totalmem().
Per test/CLAUDE.md, only use require() when testing require behavior.
65dce7d to
8500af0
Compare
There was a problem hiding this comment.
LGTM — mechanical type swap (Semver.String → raw slice) mirroring the existing Field.value pattern, plus a gated regression test; all prior review threads resolved and CI green.
Extended reasoning...
Overview
This PR fixes #21490 by changing FormData.Field.filename/content_type and the multipart iterator's name parameter from bun.Semver.String / bun_semver::String (which packs offset/length as u32 and silently wraps past 4 GiB) to native []const u8 / &[u8] slices. The same fix was already applied to Field.value in #27483; this completes it for the remaining three fields. It also drops a vestigial u32 round-trip in ArrayBuffer.fromBytes / ArrayBuffer::from_bytes where the target fields are already usize. A memory-gated subprocess regression test in FormData.test.ts verifies parts past the 4 GiB mark round-trip correctly.
Security risks
The multipart parser handles untrusted HTTP bodies, but this change adds no new parsing logic — it only changes how subslices of the input buffer are represented (native fat pointer instead of u32 offset+len packed into a u64). All slices still point into the same caller-owned input buffer with the same bounds; in the Rust port the borrow checker enforces this via the 'a lifetime on Field. The pre-existing printable-ASCII guard on content_type (header-injection hardening) is preserved verbatim. Net effect is a correctness improvement: the old code could hand back bytes from the wrong offset within the buffer.
Level of scrutiny
Moderate. The transformation is mechanical and follows the exact pattern already established for Field.value in the same struct: subslicer.sub(x).value() → x, .slice(buf) → direct use, .len()/.isEmpty() → .len/.is_empty(). forEachMultipartEntry/for_each_multipart_entry is only called from within the same file (verified via grep), so the iterator signature change has no external callers to break. The array_buffer change is a no-op simplification (removing u32::try_from(x).expect() as usize where x: usize).
Other factors
All prior review threads (CodeRabbit's Content-Type early-exit note, the cgroup-aware memory gate, and the static-import nit) are resolved — the early-exit is correctly identified as pre-existing and out of scope, and the test gate now uses process.constrainedMemory() with a totalmem() fallback. The bug-hunting pass found nothing. The robobun CI summary reports zero test failures on the current HEAD with only infra-expired darwin agents red. No CODEOWNERS cover these paths.
Problem
req.formData()on a multipart body larger than 4 GiB returns corrupted field names/filenames for any part whose header sits past the 4 GiB mark. The part is then unreachable by name:This is the remaining half of #21490. The original symptom there — a 3.5 GB file coming through as 1.5 GB (exactly 2³¹ bytes lost) — was caused by
Field.valuebeing stored asbun.Semver.String, and was fixed as a side effect of #27483 (landed in 1.3.11) whenvaluewas switched to a raw[]const u8slice.Root cause
forEachMultipartEntrystill storedname,filename, andcontent_typeasbun.Semver.String(Zig) /bun_semver::String(Rust), which packs a slice asPointer { off: u32, len: u32 }bitcast into au64with bit 63 used as a tag. That gives:len: effectivelyu31(bit 31 is the tag) → silent truncation at 2 GiBoff:u32via@truncate→ silent wrap at 4 GiBSo once a part's
Content-Dispositionheader sat past byte 2³², the offset wrapped andslice(buf)returned bytes from the middle of the preceding file body.Fix
Store
name,filename, andcontent_typeas plain slices into the input buffer, the same wayvaluealready is, in both the Zig reference implementation and the Rust port:FormData.zig:[]const u8FormData.rs:&'a [u8]on the existing lifetime-parameterizedField<'a>(no unsafe needed)The
subslicerand thebufcallback parameter are no longer needed.Also drop a leftover
u32cast inArrayBuffer.fromBytes/ArrayBuffer::from_bytes— the target fields are alreadyusize, so the cast just panicked on >4 GiB buffers in debug builds (hit on the client side when serializing a FormData containing a >4 GiBBun.file()).Rebase notes
FieldtoField<'a>withvalue: &'a [u8]on main, sofilename/content_typefollow the same shape and the interim raw-pointer +SAFETY:comments from earlier revisions are gone.FormData.test.ts(control-character Content-Type, duplicate numeric field names).Verification
New test in
FormData.test.tsconstructs a 4 GiB + 256 byte multipart body with a text field and a second file part after the big file, and asserts the trailing parts round-trip correctly. Spawned in a subprocess and skipped on Windows / machines with < 16 GiB effective memory (process.constrainedMemory()aware).["big_upload", "", ""]["big_upload", "description_field", "second_attachment"]get("description_field")null"this part lives past the 4 GiB mark"undefined"also_past_4gb.txt"Gate: test fails with
src/stashed, passes with the fix.cargo check/cargo clippy/cargo fmt --checkclean on the touched crates; the adjacent FormData tests (control characters, duplicate numeric names) pass.Fixes #21490
Related: #27483, #27443