feat(accelerator): numba costes colocalization#62
Draft
timtreis wants to merge 4 commits into
Draft
Conversation
Per-object port of get_correlation_costes + bisection_costes/linear_costes into core/numba/_costes.py::costes_per_object, reusing the #60 grouped layout (labels_to_offsets + flatten_pairs_grouped) — no sort. All three fast_costes modes: M_FASTER (bisection), M_FAST/M_ACCURATE (linear), control flow bit-reproduced (window math, num_true recompute cache, > vs >= threshold asymmetry). The reference's dead calculate_threshold call is skipped; thr is accepted for parity but unused. Pearson-on-subset matches scipy.stats.pearsonr's order (centre, normalise each vector, accumulate, clamp). error_model="numpy" so a constant subset yields NaN (not ZeroDivisionError), matching scipy's ConstantInputWarning -> nan. Exact vs numpy on float pixels (scale=1); integer-dtype diverges by design (the reference overflows z = fi + si in uint8/uint16). bzyx via to_bzyx-twice like the other coloc features. Speedup 41.8x (1080^2, 144 obj, float). Tests: kernel control-flow vs the real reference search at scale=255 (exercises the multi-iteration path), pearson vs scipy, regression vs reference, end-to-end golden 2D/3D/batch x 3 modes. Full suite 145 passed, lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Extract _flatten_image() (mask-contiguity + labels_to_offsets + flatten_pairs_grouped), shared by _run and the costes runner — removes the duplicated per-image prep chain. - Drop the dead any_fi/any_si flags in costes_per_object: tot_* is read only when n_comb > 0, which already guarantees a pixel strictly above each threshold, so the reference's any(>thr) guard is always true there. Behaviour-preserving; 57 coloc/costes tests green, lint clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bisection_costes called _count_combt then _pearson_combt for each visited threshold, but _pearson_combt's first pass recomputes the exact same subset count. Fuse them into _count_pearson_combt -> (cnt, r): one count pass, and the Pearson passes only when cnt > 2 (else r is nan and unused). Bit-identical to the previous kernel (same predicate, same accumulation order; verified array_equal across correlated AND anti-correlated objects, i.e. both search directions). ~9% faster (23.0 -> 20.9 ms, 1080^2/144 obj). 31 costes tests green. (linear/accurate modes keep _count_combt + the num_true cache.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Follow-up perf (commit 8b4ee9e): the bisection search called |
…y features The five public get_correlation_* functions each re-ran the whole fused coloc_per_object kernel (+ flatten), so computing several coloc features paid ~5x redundant work and the numba backend lost to merged numpy on manders/overlap/rwc at large. Add get_correlation_all(p1, p2, masks, features=None): one flatten + one kernel pass returns the requested coloc groups (None = all). The cheap block (Pearson+slope, Manders, Overlap, K) is one pass; RWC's rank sort and the Costes kernel are gated to the requested set. Stateless — fusion happens by requesting the set in ONE call, not via any cache. It's the efficient entry point for any caller (not only featurize). The five single-feature functions become thin gated wrappers over it (single source; each now computes only its tier, so even one direct call is minimal). The numba correlation registry KEEPS per-group keys so the featurizer's per-group selection (_collect_correlation_features) keeps working — an earlier single-entry registry broke featurize with KeyError 'pearson'. large 1080^2/142obj: all-coloc 38ms (fused) vs 103ms (5 separate) vs 140ms (numpy) = 2.7x / 3.7x. Tests: subset/gating, bit-identity vs the wrappers, empty/batch/3D, unknown-group error, and featurize-runs-under-numba. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
29dbe9e to
b0fe736
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #60. The 5th and final colocalization feature.
Per-object port of
get_correlation_costes+bisection_costes/linear_costesintocore/numba/_costes.py::costes_per_object, reusing #60's grouped layout (labels_to_offsets+flatten_pairs_grouped) — no sort. All threefast_costesmodes (M_FASTER bisection, M_FAST/M_ACCURATE linear), control flow bit-reproduced. The reference's deadcalculate_thresholdis skipped;thraccepted for parity but unused.Pearson-on-subset matches
scipy.stats.pearsonr's op order;error_model="numpy"so a constant subset → NaN (not ZeroDivisionError), matching scipy.Exact vs numpy on float pixels (scale=1); integer-dtype diverges by design (reference overflows
z = fi + si). bzyx viato_bzyx-twice. Speedup 41.8× (1080², 144 obj).Tests: kernel control-flow vs the real reference search at scale=255 (exercises the multi-iteration path), pearson vs scipy, regression vs reference, end-to-end golden 2D/3D/batch × 3 modes. Full suite 145 passed, lint clean. Stack: #59 → #60 → this.