sort: deduplicate file descriptors in merge mode#11961
Conversation
|
GNU testsuite comparison: |
Merging this PR will not alter performance
Comparing Footnotes
|
| @@ -0,0 +1,3 @@ | |||
| 1 | |||
There was a problem hiding this comment.
please generate the files on the fly
a9f249e to
3517780
Compare
| @@ -0,0 +1,6 @@ | |||
| 1 | |||
There was a problem hiding this comment.
please generate this one the fly too
There was a problem hiding this comment.
Yes, I forgot to delete it but the test is "on the fly"
#[test]
fn test_merge_mixed_stdin_and_files() {
let (at, mut ucmd) = at_and_ucmd!();
at.write("merge_duplicates_1.txt", "1\n3\n5\n");
// Verify that sort -m allows mixing stdin with files (GNU Coreutils compatible)
ucmd.arg("-m")
.arg("-")
.arg("merge_duplicates_1.txt")
.pipe_in("apricot\nelderberry\nkiwi\n")
.succeeds()
.stdout_is("1\n3\n5\napricot\nelderberry\nkiwi\n");
}
| // it gets opened for writing. This allows reading the original content | ||
| // via memory-map while writing to the same file, without needing a temp copy. | ||
| let output_as_input = if let Some(name) = output.as_output_name() { | ||
| let output_path = Path::new(name).canonicalize()?; |
There was a problem hiding this comment.
maybe move this into a function?
|
did you look if we have a benchmark covering this? thanks |
I just did and it seems there not relevant benchmark test in I suppose it would be better to add this bench test in another issue to have some reference numbers before benchmarking this PR ? |
PR #12021 |
563ddcf to
8c39166
Compare
|
Run the benchmark:
More or less the same |
3b51fcf to
ff325ce
Compare
|
The last failing test is not related to |
ff325ce to
0f0942d
Compare
|
Hello @sylvestre, Is this PR ready to merge on your side ? Plez, tell me if you need me to do more work on this ? |
|
sorry for the latency |
| error, | ||
| })?; | ||
| // SAFETY: We keep the read_fd open for the lifetime of the memory-map, | ||
| // and we only read from it. The file is not modified while the |
There was a problem hiding this comment.
Are you sure? Another process could modify the file between the mmap and when it's read. This is a known mmap hazard - it won't cause UB in practice on Linux (you'd just read stale/partial data), but the safety comments should acknowledge this rather than claiming the file isn't modified.
There was a problem hiding this comment.
No, you're right! the comment is misleading.
What I meant is the file won't be modified by the current sort process as writing to the output file won't happen during the read part of the current process.
I'll amend the comments => Tell me what you think
|
jobs are still failing |
| @@ -0,0 +1,621 @@ | |||
| // This file is part of the uutils coreutils package. | |||
There was a problem hiding this comment.
this large file is hard to review :(
There was a problem hiding this comment.
Yes !
I will go through it again and try to reduce it
ab02b32 to
0980e3e
Compare
0980e3e to
55549fa
Compare
New tests are added for merging duplicate files.
… alias Mmap as MemoryMap
What This Does
This PR makes sort -m (merge mode) use less (minimum?) opened files.
The Problem
Before:
If you ran sort -m file.txt file.txt file.txt, the program opened file.txt three times eagerly — once for every time it appeared on the command line.
With lots of duplicates or a tight system limit on open files, this could fail.
If you tried to merge a file that was also your output file, the program had to create a temporary copy behind the scenes, using one more file.
GNU version has no issue running the test in #5714
The Fix
Now the program opens each unique file only once and Lazily and use Mmap (memmap2 - unsafe) to manage one FD for all input file duplicates including re-use of output file as inputs.
Result
Fix #5714