Sparse file support#2727
Conversation
- Add ChunkType::SPAR constant for sparse file support - Create SparseMap and DataRegion structs with from_bytes/to_bytes - Add sparse_map field to NormalEntry struct - Parse SPAR chunk in NormalEntry::TryFrom, reject duplicates - Add sparse_map() accessor method - Update all NormalEntry From implementations
- Emit SPAR chunk in into_chunks() and chunks_write_in() - Add sparse_map field and set_sparse_map() to EntryBuilder - Add validation: data_size must match written bytes - Update Key Types documentation in lib.rs - Add round-trip and validation tests
Add --sparse/--no-sparse options to create command for detecting and preserving sparse files. Sparse files are stored efficiently by only archiving data regions and recording hole positions in SPAR chunk. - Add detect_sparse_map() using SEEK_HOLE/SEEK_DATA on Unix - Add write_sparse_from_path() to write only data regions - Add restore_sparse_file() to recreate sparse files on extraction - Options are unstable and require --unstable flag
- Use saturating_add in DataRegion::end() to prevent integer overflow - Use chunked I/O (64KB) in write_sparse_from_path() to prevent memory exhaustion with large regions - Use u64 for remaining bytes in restore_sparse_file() to prevent truncation on 32-bit platforms - Check lseek return value when restoring file position after detection - Always validate SparseMap invariants in new(), not just debug builds
Clarify that raw_file_size represents data size (bytes in FDAT), not logical file size. For sparse files, use sparse_map.logical_size() to get the original file size.
- Add defensive check for hole_start < data_start in sparse detection - Use checked arithmetic in SparseMap::data_size() to detect overflow - Add data size validation with improved error messages in extract - Add integration tests for all-hole, multi-region, trailing/leading hole patterns - Fix invalid --sparse flag usage in extract test commands
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis pull request introduces comprehensive sparse file support to the PNA archive format, enabling detection, preservation, and restoration of sparse files. Changes span CLI flag parsing, archive format extensions (new SPAR chunk type), entry metadata structures, extraction logic, and cross-platform detection utilities. Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant Detect as detect_sparse_map()
participant Arch as Archive Builder
participant Entry as EntryBuilder
participant Serialize as Serialization
participant Extract as Extraction
participant File as Output File
User->>Detect: File to archive
Detect->>Detect: SEEK_DATA/SEEK_HOLE or st_blocks heuristic
Detect-->>Arch: SparseMap or None
alt Sparse Detected
Arch->>Entry: set_sparse_map(SparseMap)
Entry->>Serialize: sparse_map present
Serialize->>Serialize: Emit SPAR chunk
else Not Sparse
Arch->>Entry: sparse_map = None
Entry->>Serialize: Normal write path
end
Serialize-->>User: Archive with SPAR metadata
User->>Extract: Extract with sparse map
Extract->>File: restore_sparse_file()
File->>File: Write data regions at offsets
File->>File: Set logical size
File-->>User: Sparse file restored
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @ChanTsune, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a crucial feature for handling sparse files, enabling the archiving tool to store and restore them efficiently. By intelligently identifying and omitting large blocks of zeros, the tool can create smaller archives and improve performance, especially for files common in virtual machine images or databases. The implementation includes new command-line options, a dedicated chunk type for sparse metadata, and robust detection and restoration logic, initially focusing on Unix-like systems. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive support for sparse files, a significant and well-implemented feature. The changes span from the core library, with the new SPAR chunk type and updated entry structures, to the CLI with new commands and OS-specific detection logic. The implementation is robust, and the addition of extensive integration tests covering various sparse file scenarios is particularly commendable. I have one suggestion to make the code more idiomatic.
| const CHUNK_SIZE: usize = 64 * 1024; | ||
| let mut buf = vec![0u8; CHUNK_SIZE]; | ||
| for region in sparse_map.regions() { | ||
| file.seek(io::SeekFrom::Start(region.offset()))?; | ||
| let mut remaining = region.size(); | ||
| while remaining > 0 { | ||
| let to_read = (remaining as usize).min(CHUNK_SIZE); | ||
| file.read_exact(&mut buf[..to_read])?; | ||
| entry.write_all(&buf[..to_read])?; | ||
| remaining -= to_read as u64; | ||
| } | ||
| } |
There was a problem hiding this comment.
The manual chunked reading loop for writing sparse file data regions can be simplified by using std::io::copy with a take adapter. This is more idiomatic and removes the need for manual buffer management, improving readability and maintainability. It's also important to check that the number of copied bytes matches the expected region size to handle cases where the source file might be modified during the operation.
| const CHUNK_SIZE: usize = 64 * 1024; | |
| let mut buf = vec![0u8; CHUNK_SIZE]; | |
| for region in sparse_map.regions() { | |
| file.seek(io::SeekFrom::Start(region.offset()))?; | |
| let mut remaining = region.size(); | |
| while remaining > 0 { | |
| let to_read = (remaining as usize).min(CHUNK_SIZE); | |
| file.read_exact(&mut buf[..to_read])?; | |
| entry.write_all(&buf[..to_read])?; | |
| remaining -= to_read as u64; | |
| } | |
| } | |
| for region in sparse_map.regions() { | |
| file.seek(io::SeekFrom::Start(region.offset()))?; | |
| let mut limited_reader = (&mut file).take(region.size()); | |
| let copied = io::copy(&mut limited_reader, entry)?; | |
| if copied != region.size() { | |
| return Err(io::Error::new( | |
| io::ErrorKind::UnexpectedEof, | |
| format!( | |
| "failed to read whole sparse region at offset {}: expected {} bytes, got {}", | |
| region.offset(), | |
| region.size(), | |
| copied | |
| ), | |
| )); | |
| } | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@cli/src/command/core.rs`:
- Around line 814-847: The loop in write_sparse_from_path casts remaining (u64)
to usize causing truncation on 32-bit targets; change the to_read calculation to
compute a u64-sized min like let to_read_u64 = remaining.min(CHUNK_SIZE as u64)
and then convert to usize (to_read = to_read_u64 as usize) before slicing buf
and calling file.read_exact and entry.write_all, ensuring you never cast the
full remaining directly and that buf[..to_read] uses the safely converted usize;
update references in write_sparse_from_path (CHUNK_SIZE, remaining, buf,
file.read_exact, entry.write_all) accordingly.
In `@cli/src/command/extract.rs`:
- Around line 919-924: The sparse-file extraction branch skips removing an
existing path before creating the file, which can fail for symlinks or other
edge cases when overwrite/remove_existing is enabled; update the sparse branch
(where sparse_map is Some and you call utils::fs::file_create, item.reader,
restore_sparse_file, restore_timestamps) to mirror the non-sparse branch by
checking remove_existing and calling the same removal helper used elsewhere
(e.g., the utils::fs::remove_* path removal function your codebase uses) on path
prior to calling utils::fs::file_create, then proceed to open the reader and
call restore_sparse_file and restore_timestamps as before.
- Around line 1348-1368: The loop currently converts remaining to usize with
(remaining as usize) which can truncate on 32-bit targets and cause an infinite
loop; change how to compute to_read by comparing remaining with CHUNK_SIZE in
u64 form (e.g. let to_read = std::cmp::min(remaining, CHUNK_SIZE as u64) as
usize) so the cast is safe, then use &mut buf[..to_read] in reader.read_exact
and file.write_all as before; update the loop around remaining/total_read (in
the function using region.size(), region.offset(), CHUNK_SIZE, buf,
reader.read_exact, and file.write_all) accordingly to avoid any u64→usize
truncation.
In `@cli/tests/cli/create/sparse.rs`:
- Around line 6-11: The test module file sparse.rs is present but not declared
in the parent test module, causing it to be excluded; either add a Unix-gated
module declaration in the parent (e.g., add #[cfg(unix)] mod sparse; to
cli/tests/cli/create.rs to include the file that uses
std::os::unix::fs::MetadataExt) or delete sparse.rs if the test is
obsolete—update the parent file to reference the module name sparse so the test
is compiled only on Unix.
🧹 Nitpick comments (7)
cli/src/command/stdio.rs (1)
954-954: Sparse is hardcoded tofalseacross all stdio paths — consider future--sparseflag.The
sparse: falseinitialization is consistent acrossrun_create_archive(Line 954),run_append(Line 1262), andrun_update(Line 1417). However,StdioCommandhas no--sparseCLI argument, so users of the stdio/bsdtar-compatible interface cannot opt into sparse support at all.If this is intentional for incremental rollout, a brief comment explaining why stdio mode doesn't support
--sparseyet would help future contributors.lib/src/chunk/types.rs (1)
333-342: Test comment is slightly misleading on the "Reserved" property.The comment says
Reserved (A=uppercase)which could be misread as "the reserved bit IS set". In reality, uppercase 3rd byte means the reserved bit is clear (valid). The assertion!is_set_reserved()is correct.📝 Suggested comment clarification
fn spar_chunk_properties() { - // SPAR: Critical (S=uppercase), Public (P=uppercase), - // Reserved (A=uppercase), Unsafe-to-copy (R=uppercase) + // SPAR: Critical (S=uppercase), Public (P=uppercase), + // Reserved-bit-clear (A=uppercase), Unsafe-to-copy (R=uppercase) assert!(ChunkType::SPAR.is_critical());cli/tests/cli/create/sparse.rs (1)
14-14: Prefer&Pathover&PathBufin function signatures.All helper functions take
&PathBufinstead of the more idiomatic&Path. Clippy'sclippy::ptr_arglint flags this because&PathBufauto-derefs to&Path, and accepting the broader type is more flexible.Example fix for one function (apply same pattern to all five)
-fn create_sparse_file(path: &PathBuf) -> bool { +fn create_sparse_file(path: &Path) -> bool {Also applies to: 104-104, 206-206, 346-346, 439-439
cli/src/command/extract.rs (1)
1371-1376: Consider promotingdebug_assert_eq!to a hard check.The
debug_assert_eq!on total bytes read vs. expected data size is stripped in release builds. Since a mismatch here indicates archive corruption or a bug in the read loop, a hard error would prevent silent data corruption in production.Proposed fix
- debug_assert_eq!( - total_read, expected_data_size, - "Sparse data size mismatch: read {}, expected {}", - total_read, expected_data_size - ); + if total_read != expected_data_size { + return Err(io::Error::new( + io::ErrorKind::InvalidData, + format!( + "Sparse data size mismatch: read {} bytes, expected {}", + total_read, expected_data_size + ), + )); + }cli/src/command/core.rs (1)
824-846: Non-sparse files are opened twice when--sparseis enabled.
detect_sparse_mapopens and inspects the file, then the fallback at Line 845 callswrite_from_path, which opens the same file again. This is minor (one extraopensyscall per non-sparse file), but could be avoided by reusing the already-open file handle.lib/src/entry/sparse.rs (1)
42-49:saturating_addinend()silently caps atu64::MAXon overflow.If
offset + sizewould exceedu64::MAX,end()returnsu64::MAXinstead of the true value. This meansvalidate_regions's checklast.end() <= logical_sizecould pass for a region that genuinely exceeds bounds (e.g.,offset = u64::MAX - 5, size = 10, logical_size = u64::MAX). In practice this is extremely unlikely, but you may want to add a checked-add validation invalidate_regionsto reject such edge cases.cli/src/utils/sparse.rs (1)
119-125: Consider also matchingENOSYSas unsupported.FUSE filesystems return
ENOSYSwhen they don't supportlseekoperations forSEEK_HOLE/SEEK_DATA. Without this match, the error would propagate instead of falling back to thest_blocksheuristic. This is a real scenario with filesystem implementations like sshfs and rclone.Proposed fix
fn is_seek_hole_unsupported(err: &io::Error) -> bool { matches!( err.raw_os_error(), - Some(libc::EOPNOTSUPP) | Some(libc::EINVAL) + Some(libc::EOPNOTSUPP) | Some(libc::EINVAL) | Some(libc::ENOSYS) ) }
| /// Writes file data from a path, detecting and preserving sparse regions. | ||
| /// | ||
| /// If the file is sparse, only data regions are written and the sparse map is set on the entry. | ||
| /// If the file is not sparse, falls back to normal write behavior. | ||
| pub(crate) fn write_sparse_from_path( | ||
| entry: &mut EntryBuilder, | ||
| path: impl AsRef<Path>, | ||
| ) -> io::Result<()> { | ||
| use io::Seek; | ||
|
|
||
| let path = path.as_ref(); | ||
| let mut file = fs::File::open(path)?; | ||
|
|
||
| if let Some(sparse_map) = detect_sparse_map(&file)? { | ||
| // Write only data regions using chunked I/O to avoid memory exhaustion | ||
| const CHUNK_SIZE: usize = 64 * 1024; | ||
| let mut buf = vec![0u8; CHUNK_SIZE]; | ||
| for region in sparse_map.regions() { | ||
| file.seek(io::SeekFrom::Start(region.offset()))?; | ||
| let mut remaining = region.size(); | ||
| while remaining > 0 { | ||
| let to_read = (remaining as usize).min(CHUNK_SIZE); | ||
| file.read_exact(&mut buf[..to_read])?; | ||
| entry.write_all(&buf[..to_read])?; | ||
| remaining -= to_read as u64; | ||
| } | ||
| } | ||
| entry.set_sparse_map(sparse_map); | ||
| Ok(()) | ||
| } else { | ||
| // Not sparse, use normal write | ||
| write_from_path(entry, path) | ||
| } | ||
| } |
There was a problem hiding this comment.
Same u64 → usize truncation risk as in restore_sparse_file.
Line 835 has the same (remaining as usize).min(CHUNK_SIZE) pattern that can loop infinitely on 32-bit targets when remaining > u32::MAX. Apply the same fix here.
Proposed fix
- let to_read = (remaining as usize).min(CHUNK_SIZE);
+ let to_read = usize::try_from(remaining).unwrap_or(CHUNK_SIZE).min(CHUNK_SIZE);🤖 Prompt for AI Agents
In `@cli/src/command/core.rs` around lines 814 - 847, The loop in
write_sparse_from_path casts remaining (u64) to usize causing truncation on
32-bit targets; change the to_read calculation to compute a u64-sized min like
let to_read_u64 = remaining.min(CHUNK_SIZE as u64) and then convert to usize
(to_read = to_read_u64 as usize) before slicing buf and calling file.read_exact
and entry.write_all, ensuring you never cast the full remaining directly and
that buf[..to_read] uses the safely converted usize; update references in
write_sparse_from_path (CHUNK_SIZE, remaining, buf, file.read_exact,
entry.write_all) accordingly.
| } else if let Some(sparse_map) = sparse_map { | ||
| // Sparse file restoration - write data regions at correct offsets | ||
| let mut file = utils::fs::file_create(&path, remove_existing)?; | ||
| let mut reader = item.reader(ReadOptions::with_password(password))?; | ||
| restore_sparse_file(&mut file, &mut reader, sparse_map)?; | ||
| restore_timestamps(&mut file, item.metadata(), keep_options)?; |
There was a problem hiding this comment.
Missing explicit file removal before creation in the sparse path.
The non-sparse, non-safe-writes branch (Line 926-929) explicitly removes the existing path before calling file_create when remove_existing is true. The sparse branch here skips that step. If file_create doesn't handle all edge cases (e.g., existing symlinks), this could cause extraction failures under --overwrite.
Proposed fix
} else if let Some(sparse_map) = sparse_map {
// Sparse file restoration - write data regions at correct offsets
+ if remove_existing {
+ utils::io::ignore_not_found(utils::fs::remove_path(&path))?;
+ }
let mut file = utils::fs::file_create(&path, remove_existing)?;📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| } else if let Some(sparse_map) = sparse_map { | |
| // Sparse file restoration - write data regions at correct offsets | |
| let mut file = utils::fs::file_create(&path, remove_existing)?; | |
| let mut reader = item.reader(ReadOptions::with_password(password))?; | |
| restore_sparse_file(&mut file, &mut reader, sparse_map)?; | |
| restore_timestamps(&mut file, item.metadata(), keep_options)?; | |
| } else if let Some(sparse_map) = sparse_map { | |
| // Sparse file restoration - write data regions at correct offsets | |
| if remove_existing { | |
| utils::io::ignore_not_found(utils::fs::remove_path(&path))?; | |
| } | |
| let mut file = utils::fs::file_create(&path, remove_existing)?; | |
| let mut reader = item.reader(ReadOptions::with_password(password))?; | |
| restore_sparse_file(&mut file, &mut reader, sparse_map)?; | |
| restore_timestamps(&mut file, item.metadata(), keep_options)?; |
🤖 Prompt for AI Agents
In `@cli/src/command/extract.rs` around lines 919 - 924, The sparse-file
extraction branch skips removing an existing path before creating the file,
which can fail for symlinks or other edge cases when overwrite/remove_existing
is enabled; update the sparse branch (where sparse_map is Some and you call
utils::fs::file_create, item.reader, restore_sparse_file, restore_timestamps) to
mirror the non-sparse branch by checking remove_existing and calling the same
removal helper used elsewhere (e.g., the utils::fs::remove_* path removal
function your codebase uses) on path prior to calling utils::fs::file_create,
then proceed to open the reader and call restore_sparse_file and
restore_timestamps as before.
| file.seek(io::SeekFrom::Start(region.offset()))?; | ||
| let mut remaining = region.size(); | ||
| while remaining > 0 { | ||
| let to_read = (remaining as usize).min(CHUNK_SIZE); | ||
| reader.read_exact(&mut buf[..to_read]).map_err(|e| { | ||
| io::Error::new( | ||
| e.kind(), | ||
| format!( | ||
| "Failed to read sparse data at offset {}: {} \ | ||
| (expected {} bytes total, read {} so far)", | ||
| region.offset(), | ||
| e, | ||
| expected_data_size, | ||
| total_read | ||
| ), | ||
| ) | ||
| })?; | ||
| file.write_all(&buf[..to_read])?; | ||
| remaining -= to_read as u64; | ||
| total_read += to_read as u64; | ||
| } |
There was a problem hiding this comment.
Potential infinite loop on 32-bit platforms due to u64 → usize truncation.
On a 32-bit target, remaining as usize (Line 1351) silently truncates values above u32::MAX. If the truncated result is 0, the loop will never decrease remaining, spinning forever. While 32-bit is unlikely for large sparse regions, this is a correctness gap.
Proposed fix using safe conversion
while remaining > 0 {
- let to_read = (remaining as usize).min(CHUNK_SIZE);
+ let to_read = usize::try_from(remaining).unwrap_or(CHUNK_SIZE).min(CHUNK_SIZE);🤖 Prompt for AI Agents
In `@cli/src/command/extract.rs` around lines 1348 - 1368, The loop currently
converts remaining to usize with (remaining as usize) which can truncate on
32-bit targets and cause an infinite loop; change how to compute to_read by
comparing remaining with CHUNK_SIZE in u64 form (e.g. let to_read =
std::cmp::min(remaining, CHUNK_SIZE as u64) as usize) so the cast is safe, then
use &mut buf[..to_read] in reader.read_exact and file.write_all as before;
update the loop around remaining/total_read (in the function using
region.size(), region.offset(), CHUNK_SIZE, buf, reader.read_exact, and
file.write_all) accordingly to avoid any u64→usize truncation.
| use std::{ | ||
| fs::{self, File}, | ||
| io::{Read, Seek, SeekFrom, Write}, | ||
| os::unix::fs::MetadataExt, | ||
| path::PathBuf, | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for mod.rs in cli/tests/cli/create to see if sparse is gated
fd -t f 'mod.rs' 'cli/tests/cli/create' --exec cat {}
echo "---"
# Search for mod sparse declarations with context
rg -n 'mod sparse' --type rust 'cli/tests'Repository: ChanTsune/Portable-Network-Archive
Length of output: 81
🏁 Script executed:
#!/bin/bash
# List files in the directory containing sparse.rs
ls -la cli/tests/cli/create/
echo "---"
# Search for sparse.rs location
fd sparse.rs
echo "---"
# Search for mod sparse declarations more broadly
rg 'mod sparse' --type rust
echo "---"
# Check if sparse.rs exists and its parent structure
fd sparse.rs --exec head -20 {}Repository: ChanTsune/Portable-Network-Archive
Length of output: 4531
🏁 Script executed:
#!/bin/bash
# Look for mod.rs in parent directories
fd -t f 'mod.rs' 'cli/tests'
echo "---"
# Check if there's a parent file that includes these tests
rg -n 'sparse' 'cli/tests/cli.rs' --type rust
echo "---"
# Search for common module declaration patterns for the tests
rg -n 'mod (atime|ctime|sparse)' 'cli/tests' --type rust -B2 -A2Repository: ChanTsune/Portable-Network-Archive
Length of output: 497
🏁 Script executed:
#!/bin/bash
# Read cli/tests/cli/create.rs to find sparse module declaration
cat cli/tests/cli/create.rs | head -50Repository: ChanTsune/Portable-Network-Archive
Length of output: 854
🏁 Script executed:
#!/bin/bash
# Search for sparse module declaration specifically
rg -n 'mod sparse' cli/tests/cli/create.rs --type rust -B1 -A1Repository: ChanTsune/Portable-Network-Archive
Length of output: 60
🏁 Script executed:
#!/bin/bash
# Get the full list of modules in cli/tests/cli/create.rs
cat cli/tests/cli/create.rsRepository: ChanTsune/Portable-Network-Archive
Length of output: 854
Remove or add the sparse test module to the parent.
The sparse.rs file exists in cli/tests/cli/create/ but is not declared in cli/tests/cli/create.rs. Either add #[cfg(unix)] mod sparse; to include it (since it requires Unix-only APIs like std::os::unix::fs::MetadataExt), or remove the file if it's no longer needed.
🤖 Prompt for AI Agents
In `@cli/tests/cli/create/sparse.rs` around lines 6 - 11, The test module file
sparse.rs is present but not declared in the parent test module, causing it to
be excluded; either add a Unix-gated module declaration in the parent (e.g., add
#[cfg(unix)] mod sparse; to cli/tests/cli/create.rs to include the file that
uses std::os::unix::fs::MetadataExt) or delete sparse.rs if the test is
obsolete—update the parent file to reference the module name sparse so the test
is compiled only on Unix.
The sparse.rs test file was never compiled or executed because `mod sparse;` was missing from the parent module create.rs.
Summary by CodeRabbit
Release Notes
New Features
--sparseand--no-sparsecommand-line flags (unstable)Bug Fixes