Fix atomic slide processing locks by BWAAEEEK · Pull Request #213 · mahmoodlab/TRIDENT

BWAAEEEK · 2026-05-13T05:38:29Z

Summary

Fix lock handling for parallel slide processing jobs.

This PR fixes a verified race condition where multiple workers could acquire the same slide lock and proceed with duplicate work for segmentation, patch coordinate generation, patch feature extraction, or slide feature extraction.

The issue was reproduced against the previous lock implementation: concurrent processes could all observe that no lock existed, then each call create_lock(). Since the old implementation used open(lock_file, \"w\"), the lock file could be overwritten instead of failing, allowing more than one worker to proceed.

This PR also fixes failure cleanup behavior so failed jobs do not leave stale lock files or partial target outputs that could cause future runs to incorrectly skip work as already completed.

What Changed

Made create_lock() atomic using os.O_CREAT | os.O_EXCL.
Changed create_lock() to return True when acquired and False when another worker already owns the lock.
Preserved lock metadata: pid, hostname, and created_at.
Added cleanup if lock metadata writing fails after the lock file is created.
Updated Processor lock handling for segmentation, patch coordinate generation, patch feature extraction, and slide feature extraction.
Added finally cleanup so acquired locks are removed when processing fails.
Removed incomplete target output files when a locked job fails after partially writing output.
Added post-lock output checks to avoid duplicate work if another worker finishes between the initial preflight check and lock acquisition.
Fixed coords handling so an existing .h5 with an active .lock is treated as locked, not as completed.

Verified Issue

Before the fix, concurrent lock creation was reproduced and showed that multiple workers could proceed against the same target because the lock file was overwritten instead of atomically claimed.

After the fix, the same process-level concurrency test showed only one worker acquiring the lock:

{'acquired': 1, 'results': [False, True, False, False, False, False, False, False]}

Failure cleanup was also verified by reproducing a job failure after partial output creation. After the fix:

the .lock file is removed
the partial target output is removed
the next run retries the work instead of incorrectly skipping it as completed

Tests

python -m compileall -q trident tests
MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m unittest tests.test_wsi_core_behaviors.TestIOLocks tests.test_processor_lifecycle.TestProcessorLifecycle -v
MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m pytest -q
- 69 passed, 49 skipped
MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m unittest discover -s tests -p 'test_*.py' -v
- 116 tests OK, skipped=49
PATH=\"$PWD/.venv/bin:$PATH\" MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/sphinx-build -b html docs docs/_build/html

Skipped tests are existing integration/GPU/heavy dependency tests gated by repo environment flags.

Fix atomic slide processing locks

3920464

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix atomic slide processing locks#213

Fix atomic slide processing locks#213
BWAAEEEK wants to merge 1 commit into
mahmoodlab:mainfrom
BWAAEEEK:fix/atomic-slide-locks

BWAAEEEK commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

BWAAEEEK commented May 13, 2026

Summary

What Changed

Verified Issue

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant