Skip to content

sort: deduplicate file descriptors in merge mode#11961

Open
nonontb wants to merge 16 commits into
uutils:mainfrom
nonontb:feature/sort-input-fd-optimization
Open

sort: deduplicate file descriptors in merge mode#11961
nonontb wants to merge 16 commits into
uutils:mainfrom
nonontb:feature/sort-input-fd-optimization

Conversation

@nonontb
Copy link
Copy Markdown
Contributor

@nonontb nonontb commented Apr 23, 2026

What This Does

This PR makes sort -m (merge mode) use less (minimum?) opened files.

The Problem

Before:

If you ran sort -m file.txt file.txt file.txt, the program opened file.txt three times eagerly — once for every time it appeared on the command line.
With lots of duplicates or a tight system limit on open files, this could fail.

If you tried to merge a file that was also your output file, the program had to create a temporary copy behind the scenes, using one more file.

GNU version has no issue running the test in #5714

The Fix

Now the program opens each unique file only once and Lazily and use Mmap (memmap2 - unsafe) to manage one FD for all input file duplicates including re-use of output file as inputs.

Result

Fix #5714

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 23, 2026

GNU testsuite comparison:

Skip an intermittent issue tests/tail/symlink (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/pr/bounded-memory (passes in this run but fails in the 'main' branch)
Note: The gnu test tests/misc/write-errors was skipped on 'main' but is now failing.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Apr 23, 2026

Merging this PR will not alter performance

✅ 319 untouched benchmarks
⏩ 46 skipped benchmarks1


Comparing nonontb:feature/sort-input-fd-optimization (59e06a5) with main (7f00530)

Open in CodSpeed

Footnotes

  1. 46 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@@ -0,0 +1,3 @@
1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please generate the files on the fly

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nonontb nonontb force-pushed the feature/sort-input-fd-optimization branch 5 times, most recently from a9f249e to 3517780 Compare April 26, 2026 08:14
@@ -0,0 +1,6 @@
1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please generate this one the fly too

Copy link
Copy Markdown
Contributor Author

@nonontb nonontb Apr 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I forgot to delete it but the test is "on the fly"

#[test]
fn test_merge_mixed_stdin_and_files() {
    let (at, mut ucmd) = at_and_ucmd!();
    at.write("merge_duplicates_1.txt", "1\n3\n5\n");
    // Verify that sort -m allows mixing stdin with files (GNU Coreutils compatible)
    ucmd.arg("-m")
        .arg("-")
        .arg("merge_duplicates_1.txt")
        .pipe_in("apricot\nelderberry\nkiwi\n")
        .succeeds()
        .stdout_is("1\n3\n5\napricot\nelderberry\nkiwi\n");
}

Comment thread src/uu/sort/src/merge.rs Outdated
// it gets opened for writing. This allows reading the original content
// via memory-map while writing to the same file, without needing a temp copy.
let output_as_input = if let Some(name) = output.as_output_name() {
let output_path = Path::new(name).canonicalize()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move this into a function?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@sylvestre
Copy link
Copy Markdown
Contributor

did you look if we have a benchmark covering this? thanks

@nonontb
Copy link
Copy Markdown
Contributor Author

nonontb commented Apr 26, 2026

did you look if we have a benchmark covering this? thanks

I just did and it seems there not relevant benchmark test in src/uu/sort/benches/sort_bench.rs even there is some doc to explain how to do it in src/uu/sort/BENCHMARKING.md or I miss something.

I suppose it would be better to add this bench test in another issue to have some reference numbers before benchmarking this PR ?

@nonontb
Copy link
Copy Markdown
Contributor Author

nonontb commented Apr 26, 2026

did you look if we have a benchmark covering this? thanks

I just did and it seems there not relevant benchmark test in src/uu/sort/benches/sort_bench.rs even there is some doc to explain how to do it in src/uu/sort/BENCHMARKING.md or I miss something.

I suppose it would be better to add this bench test in another issue to have some reference numbers before benchmarking this PR ?

PR #12021

@nonontb nonontb force-pushed the feature/sort-input-fd-optimization branch from 563ddcf to 8c39166 Compare April 27, 2026 17:24
@nonontb
Copy link
Copy Markdown
Contributor Author

nonontb commented Apr 27, 2026

Run the benchmark:

  • main:
Timer precision: 26 ns
sort_bench_merge           fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ merge_pre_sorted_files  9.209 ms      │ 102.4 ms      │ 9.962 ms      │ 10.91 ms      │ 100     │ 100
  • This branch:
Timer precision: 25 ns
sort_bench_merge           fastest       │ slowest       │ median        │ mean          │ samples │ iters
╰─ merge_pre_sorted_files  9.243 ms      │ 99.39 ms      │ 9.981 ms      │ 11.18 ms      │ 100     │ 100

More or less the same

@nonontb nonontb force-pushed the feature/sort-input-fd-optimization branch from 3b51fcf to ff325ce Compare April 29, 2026 09:26
@nonontb
Copy link
Copy Markdown
Contributor Author

nonontb commented Apr 30, 2026

The last failing test is not related to sort and code involved in this PR

@nonontb nonontb force-pushed the feature/sort-input-fd-optimization branch from ff325ce to 0f0942d Compare May 1, 2026 17:29
@nonontb
Copy link
Copy Markdown
Contributor Author

nonontb commented May 15, 2026

Hello @sylvestre,

Is this PR ready to merge on your side ?
It will fix the last fail test on the sort utility: sort-merge-fdlimit.

Plez, tell me if you need me to do more work on this ?

@nonontb nonontb requested a review from sylvestre May 15, 2026 11:03
@sylvestre
Copy link
Copy Markdown
Contributor

sorry for the latency

Comment thread src/uu/sort/src/sort_input.rs Outdated
Comment thread tests/by-util/test_sort.rs Outdated
Comment thread src/uu/sort/src/merge.rs Outdated
error,
})?;
// SAFETY: We keep the read_fd open for the lifetime of the memory-map,
// and we only read from it. The file is not modified while the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure? Another process could modify the file between the mmap and when it's read. This is a known mmap hazard - it won't cause UB in practice on Linux (you'd just read stale/partial data), but the safety comments should acknowledge this rather than claiming the file isn't modified.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you're right! the comment is misleading.
What I meant is the file won't be modified by the current sort process as writing to the output file won't happen during the read part of the current process.
I'll amend the comments => Tell me what you think

@sylvestre
Copy link
Copy Markdown
Contributor

jobs are still failing

@@ -0,0 +1,621 @@
// This file is part of the uutils coreutils package.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this large file is hard to review :(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes !
I will go through it again and try to reduce it

@nonontb nonontb force-pushed the feature/sort-input-fd-optimization branch from ab02b32 to 0980e3e Compare May 28, 2026 16:53
@nonontb nonontb closed this May 28, 2026
@nonontb nonontb force-pushed the feature/sort-input-fd-optimization branch from 0980e3e to 55549fa Compare May 28, 2026 17:36
@nonontb nonontb reopened this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sort opens too many files

2 participants