feat(sanitize): central non-contiguous label-ID sanitation#89
Open
timtreis wants to merge 8 commits into
Open
Conversation
Relabel arbitrary or non-contiguous mask label IDs (e.g. {1,5,17}) to
contiguous 1..N as early as possible in the pipeline, so every measurement
function downstream can assume clean labels and focus on the math. The caller's
array is never mutated and results are reported against the original IDs.
- new cp_measure._sanitize: sanitize_masks() built on
skimage.segmentation.relabel_sequential (already a dependency), plus a thin
@sanitize_labels decorator that finds the label argument by name
- featurizer sanitizes once at the top of the per-mask loop and emits original
IDs as row labels (was range(1, n+1))
- @sanitize_labels applied to every core/correlation/numba get_* entry so
direct callers get the same guarantee (cheap idempotent guard)
- raises ValueError on negative / non-integer labels
- tests: gapped {1,5,17} == contiguous baseline per function, featurizer
end-to-end with original-ID rows, no-mutation, validation, 3D, registry
decoration meta-test (partial-aware, multimask excluded)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Benchmark —
|
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.98× | 1.00× | 1.01× |
| 512 | 0.96× | 0.98× | 1.02× |
| 1024 | 0.93× | 0.96× | 0.99× |
feret
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.98× | 0.97× | 0.96× |
| 512 | 0.95× | 0.96× | 0.97× |
| 1024 | 0.93× | 0.97× | 0.93× |
manders_fold
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.82× | 0.93× | 0.98× |
| 512 | 0.78× | 0.82× | 0.96× |
| 1024 | 0.77× | 0.81× | 0.84× |
pearson
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.94× | 0.99× | 0.99× |
| 512 | 0.91× | 0.94× | 1.01× |
| 1024 | 0.88× | 0.91× | 0.96× |
rwc
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 0.96× | 0.97× | 0.98× |
| 512 | 0.94× | 0.96× | 0.99× |
| 1024 | 0.94× | 0.95× | 0.95× |
texture
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 1.00× | 0.99× | 0.98× |
| 512 | 1.01× | 0.99× | 0.99× |
| 1024 | 1.11× | 1.03× | 1.00× |
zernike
| size \ objects | 16 | 64 | 256 |
|---|---|---|---|
| 256 | 1.00× | 0.99× | 0.97× |
| 512 | 0.94× | 0.95× | 0.99× |
| 1024 | 1.00× | 1.01× | 0.98× |
Review-driven cleanup of the sanitation layer: - collapse sanitize_masks to a single numpy.unique pass: derive the contiguity test, negative-value guard and original IDs from one sorted array; drop the separate _is_contiguous helper and the redundant max()/min()/second-unique passes - accept boolean masks (treated as one object) instead of raising — restores the pre-existing behaviour of the raw measurement functions - decorate get_correlation_overlap (the one public correlation entry that was missed), so direct callers get correct results on gapped labels - trim the over-long module/function docstrings and the featurizer comment - tests: parametrize the policy/edge cases, exercise the rank remap with non-monotonic labels, assert real multimask functions stay unsanitized, add a bool-mask case (net leaner, 109 passing) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The decorator ran numpy.unique over the whole image on every call (~10 ms on 1024²), a fixed cost that dominated cheap functions and large/few-object images (manders_fold fell to 0.40×). Replace it with a max() gate plus a bincount presence check (~4x faster), falling back to unique only for pathologically large labels. relabel_sequential still handles the rare gapped path. Measured on 1024²: manders_fold 0.74→0.91×, intensity 0.97×, feret 0.99×; expensive functions (zernike/sizeshape/texture) were already ~1.0×. Negatives still raise; bit-identical results. 109 tests pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
featurize() sanitized each mask up front and then called the @sanitize_labels- decorated functions, so every feature re-ran the contiguity scan on the already-clean mask (~12x per object-type). Unwrap the decorator (via the __wrapped__ the wrapper already sets, partial-aware for the legacy path) so the featurizer pays the sanitation cost exactly once per mask. Direct callers still get the decorator's guarantee. Test asserts sanitize_masks is called once per object-type mask. 110 pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
of note: the faster the actual runtime of a given function, the more the overhead of this hits, even when short-circuiting, f.e. in |
The per-function equivalence tests parametrized over all ~13 measurement functions were redundant: the decorator is uniform, so the coverage meta-test (every public function wrapped) + one representative direct call (zernike, which crashes on gaps without sanitation) + the featurizer end-to-end test cover the same ground. Also merged the mutation/copy cases, dropped the separate numba meta-test, and removed a dead second monkeypatch. 178 -> 102 lines. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With sanitize_masks guaranteeing contiguous 1..N at every entry, the gap-tolerant label bookkeeping inside the feature functions is dead weight. Remove it: - texture/granularity/radial_distribution/radial_zernikes: drop the internal `numpy.unique(masks)` enumeration (a full-image sort, redundant after the decorator's pass). Use len(props) / range_.size / arange(1, max+1) instead. - numba intensity: labels are 1..N, so the label->index map is just `label - 1`. Drop `lut` from flatten_numba (and the edge path) and compute it inline; nobjects = masks.max(). - delete primitives/segment.py: label_to_idx_lut had no remaining consumers. All outputs verified bit-identical on contiguous input (numpy + numba), numba gapped input maps correctly, backend-correctness + full suite (91) pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ac9950c to
b21be05
Compare
…tring flatten_numba (in this module) builds the flat arrays now; segment.py is gone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sanitation relabels arbitrary IDs to 1..N; drop the relabel_sequential caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Relabel non-contiguous / arbitrary mask IDs (e.g.
{1,5,17}) to1..Nso measurement code can assume clean labels. Input is never mutated; results map back to original IDs.How
cp_measure._sanitize.sanitize_masks(masks) -> (clean, ids)viaskimage.segmentation.relabel_sequential(existing dep). Singleuniquepass; no-copy fast path when already contiguous; bool masks = one object; raises on negative/non-integer.@sanitize_labelson every core/correlation/numbaget_*for direct callers.