Fix atomic slide processing locks#213
Open
BWAAEEEK wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix lock handling for parallel slide processing jobs.
This PR fixes a verified race condition where multiple workers could acquire the same slide lock and proceed with duplicate work for segmentation, patch coordinate generation, patch feature extraction, or slide feature extraction.
The issue was reproduced against the previous lock implementation: concurrent processes could all observe that no lock existed, then each call
create_lock(). Since the old implementation usedopen(lock_file, \"w\"), the lock file could be overwritten instead of failing, allowing more than one worker to proceed.This PR also fixes failure cleanup behavior so failed jobs do not leave stale lock files or partial target outputs that could cause future runs to incorrectly skip work as already completed.
What Changed
create_lock()atomic usingos.O_CREAT | os.O_EXCL.create_lock()to returnTruewhen acquired andFalsewhen another worker already owns the lock.pid,hostname, andcreated_at.finallycleanup so acquired locks are removed when processing fails..h5with an active.lockis treated as locked, not as completed.Verified Issue
Before the fix, concurrent lock creation was reproduced and showed that multiple workers could proceed against the same target because the lock file was overwritten instead of atomically claimed.
After the fix, the same process-level concurrency test showed only one worker acquiring the lock:
Failure cleanup was also verified by reproducing a job failure after partial output creation. After the fix:
.lockfile is removedTests
python -m compileall -q trident testsMPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m unittest tests.test_wsi_core_behaviors.TestIOLocks tests.test_processor_lifecycle.TestProcessorLifecycle -vMPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m pytest -q69 passed, 49 skippedMPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m unittest discover -s tests -p 'test_*.py' -v116 tests OK, skipped=49PATH=\"$PWD/.venv/bin:$PATH\" MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/sphinx-build -b html docs docs/_build/htmlSkipped tests are existing integration/GPU/heavy dependency tests gated by repo environment flags.