Skip to content

Fix atomic slide processing locks#213

Open
BWAAEEEK wants to merge 1 commit into
mahmoodlab:mainfrom
BWAAEEEK:fix/atomic-slide-locks
Open

Fix atomic slide processing locks#213
BWAAEEEK wants to merge 1 commit into
mahmoodlab:mainfrom
BWAAEEEK:fix/atomic-slide-locks

Conversation

@BWAAEEEK

Copy link
Copy Markdown

Summary

Fix lock handling for parallel slide processing jobs.

This PR fixes a verified race condition where multiple workers could acquire the same slide lock and proceed with duplicate work for segmentation, patch coordinate generation, patch feature extraction, or slide feature extraction.

The issue was reproduced against the previous lock implementation: concurrent processes could all observe that no lock existed, then each call create_lock(). Since the old implementation used open(lock_file, \"w\"), the lock file could be overwritten instead of failing, allowing more than one worker to proceed.

This PR also fixes failure cleanup behavior so failed jobs do not leave stale lock files or partial target outputs that could cause future runs to incorrectly skip work as already completed.

What Changed

  • Made create_lock() atomic using os.O_CREAT | os.O_EXCL.
  • Changed create_lock() to return True when acquired and False when another worker already owns the lock.
  • Preserved lock metadata: pid, hostname, and created_at.
  • Added cleanup if lock metadata writing fails after the lock file is created.
  • Updated Processor lock handling for segmentation, patch coordinate generation, patch feature extraction, and slide feature extraction.
  • Added finally cleanup so acquired locks are removed when processing fails.
  • Removed incomplete target output files when a locked job fails after partially writing output.
  • Added post-lock output checks to avoid duplicate work if another worker finishes between the initial preflight check and lock acquisition.
  • Fixed coords handling so an existing .h5 with an active .lock is treated as locked, not as completed.

Verified Issue

Before the fix, concurrent lock creation was reproduced and showed that multiple workers could proceed against the same target because the lock file was overwritten instead of atomically claimed.

After the fix, the same process-level concurrency test showed only one worker acquiring the lock:

{'acquired': 1, 'results': [False, True, False, False, False, False, False, False]}

Failure cleanup was also verified by reproducing a job failure after partial output creation. After the fix:

  • the .lock file is removed
  • the partial target output is removed
  • the next run retries the work instead of incorrectly skipping it as completed

Tests

  • python -m compileall -q trident tests
  • MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m unittest tests.test_wsi_core_behaviors.TestIOLocks tests.test_processor_lifecycle.TestProcessorLifecycle -v
  • MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m pytest -q
    • 69 passed, 49 skipped
  • MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/python -m unittest discover -s tests -p 'test_*.py' -v
    • 116 tests OK, skipped=49
  • PATH=\"$PWD/.venv/bin:$PATH\" MPLCONFIGDIR=/tmp/trident-mpl .venv/bin/sphinx-build -b html docs docs/_build/html

Skipped tests are existing integration/GPU/heavy dependency tests gated by repo environment flags.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant