perf(sizeshape): vectorize get_zernike on foreground pixels (~8x)#74
perf(sizeshape): vectorize get_zernike on foreground pixels (~8x)#74timtreis wants to merge 4 commits into
Conversation
6c4ce25 to
acd96a6
Compare
`get_zernike` delegated to `centrosome.zernike.zernike`, which scatters the per-pixel Zernike basis into a full `(H, W, K)` complex array (~560 MB at 1080^2, K~30) and scores via ~60 full-image `scipy.ndimage.sum` calls. Both costs scale with image area rather than object pixels, so most work lands on background. Replace it with a pure-numpy `_zernike_scores` helper that keeps the basis on the masked foreground vectors and segment-sums each moment by label with a single `numpy.bincount`. The Horner basis evaluation is copied verbatim from `centrosome.construct_zernike_polynomials` (same lookup-table coefficients, `r**2 > 1` cutoff and `z = y + i*x` convention), so results track the installed centrosome to floating-point round-off (bit-identical in practice). The helper lives in `cp_measure.utils` (alongside `masks_to_ijv`) and takes an optional pixel `weight` so the sibling `get_radial_zernikes` can reuse it for intensity-weighted moments in a follow-up PR. Measured: 8.6x at typical density (782->98 ms large tier), 3.7-44x depending on foreground fraction. No new deps. Also switches from `range(1, n+1)` to the actual unique labels (identical on contiguous masks, correct for non-contiguous). Adds golden + edge tests (empty, single-pixel r=0, non-contiguous labels, edge-touching, non-default zernike_numbers) asserting parity with centrosome. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
acd96a6 to
d75d3f0
Compare
Reuse primitives.segment.label_to_idx_lut for the label->row map (correct sizing, find_objects-based) instead of a hand-rolled reverse map keyed on masks.max(); derive labels internally so get_zernike no longer needs its own unique() pass. Single foreground gather, skip the identically-zero imaginary segment-sum for m==0 moments, and precompute the azimuthal powers once. Return (real_sums, imag_sums, radii, counts): radii feeds get_zernike's pi*r**2 normalisation, counts the intensity-weighted radial Zernikes (PR #75), which reuse this via the restored `weight` arg. Add weighted + count golden tests vs centrosome so no path ships untested. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ke_scores Single-pixel objects have an enclosing-circle radius of 0, so the unit-disk coordinate division is 0/0 -> NaN (discarded later by the r**2 > 1 cutoff, matching centrosome). Wrap it in numpy.errstate so the expected RuntimeWarning isn't emitted from this shared helper (it would otherwise crash callers running under -W error::RuntimeWarning). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Overall looks good, I think I got a decent notion of the implementation. I added some minor comments, and I think we should reevaluate the lookup table, since our contract is that labels are contiguous. I also added that we shouldn't overload the utils file unless _zernike_scores is used in multiple places. Finally, I suggested removing some tests that are redundant or test something that I plan to remove (e.g., individual function checking dimensionality and returning an empty dict).
Also please always add the speed improvements as we increase both image and mask size. We should understand the tradeoffs before we fully commit to the new implementation.
| lut, n = label_to_idx_lut(masks) | ||
| k = len(zernike_indexes) | ||
| labels = numpy.flatnonzero(lut >= 0) | ||
| centers, radii = centrosome.zernike.minimum_enclosing_circle(masks, labels) |
There was a problem hiding this comment.
centrosome is not explicitly imported (it's centrosome.zernike). This should be just zernike.XXX
| ym = (rows - centers[seg, 0]) / radii[seg] | ||
| xm = (cols - centers[seg, 1]) / radii[seg] | ||
|
|
||
| coeffs = centrosome.zernike.construct_zernike_lookuptable(zernike_indexes) |
| """Old centrosome path, called with the real label values as indices.""" | ||
| uniq = numpy.unique(masks) | ||
| uniq = uniq[uniq > 0] | ||
| zidx = centrosome.zernike.get_zernike_indexes(zernike_numbers + 1) |
There was a problem hiding this comment.
ditto centrosome.zernike -> zernike to be consistent with the imports (avoiding hidden state). I'm still curious as to whether or not centrosome.zernike automagically imports centrosome itself.
| import numpy | ||
| from numpy.typing import NDArray | ||
|
|
||
| from cp_measure.primitives.segment import label_to_idx_lut |
There was a problem hiding this comment.
So maybe I should have caught this with the previous PR that introduced label_to_idx_lut. Doesn't that function not make the assumption of dense (i.e., contiguous) labels? This is a contract I decided an cp_measure and if it can simplify the code and/or speed up things then I think it should be assumed that we have contiguous labels.
There was a problem hiding this comment.
Yeah, after re-reading the function and discussing the options with claude, I think we should not need the lookup table given that we assume a clean (contiguous) input.
There was a problem hiding this comment.
I think we should be aware of the costs of this (especially for 3D images, where the find_objects function may be more costly)
| ) | ||
|
|
||
|
|
||
| def _square_objects(size, n, gap_frac=0.75): |
There was a problem hiding this comment.
My personal preference is to prefix the test-case generators with generate_X e.g., _generate_square_objects.
| assert all(v.shape == (0,) for v in got.values()) | ||
|
|
||
|
|
||
| def test_zernike_3d_returns_empty(): |
There was a problem hiding this comment.
We should not encode this here. Eventually we should have a layer that validates which measurements are for 2d, 3d and both, and this will get in the way (As it normally it should raise an error, #35 is a band-aid). Whether or not we get 2D or 3D measurements is to be determined before they are actually used (at one of the three entry points, not within the "math")
| return {f"Zernike_{n}_{m}": zf[:, i] for i, (n, m) in enumerate(zidx)} | ||
|
|
||
|
|
||
| def _assert_matches(masks, zernike_numbers=9): |
There was a problem hiding this comment.
Add a fixture to test more than one zernike number (e.g., 5,9,14) uncommon, but worth checking)
| assert numpy.allclose(counts, ref_counts) | ||
|
|
||
|
|
||
| def test_zernike_scores_unit_weight_equals_unweighted(): |
There was a problem hiding this comment.
I don't love this test, it is essentially checking that the following two branches work:
w = None if weight is None else weight[keep].astype(float)
if w is not None:
s *= w
So there is way more test code than the actual code that it's checking. Please remove this test
| _assert_matches(_square_objects(256, 4)) | ||
|
|
||
|
|
||
| def test_zernike_matches_centrosome_irregular(): |
There was a problem hiding this comment.
It may seem silly, but I think this covers most of the previous two tests (single_object and multi_object). Please remove those two and keep this one.
`python -m cp_measure._bench.targets --base <ref> --head <ref>` resolves a PR diff to exactly the measurement functions it changes, for the benchmark action. Resolution is SYMBOL-level, not file-level: it builds a static symbol-reference graph over the package (AST, resolving intra-package imports incl. submodule and relative imports) and selects a feature iff its call graph transitively reaches a changed symbol. So a shared-helper edit (e.g. utils._zernike_scores) selects only the features that actually use it — verified on the real PRs: #74 -> {zernike}, #75 -> {radial_zernikes}, where file-closure would have over-selected the ~6 features whose modules merely import utils. - Rooted at an explicit entry-point table (the get_* registry) so bulk.py's lazy numba/multimask imports can't cause an entry-point to be missed; a test cross-checks the table against the live registries by function identity. - Reads everything from git refs (git show), diffing against the merge-base, so it matches CI and is correct for stacked PRs given the PR's real base. - Three distinct states: benchmarked / skipped-unsupported (multimask, numba) / empty — a multimask-only PR is never mistaken for "no measurement change". - Tolerates the get_ferret->get_feret cross-branch rename via name candidates. Hermetic tests build a throwaway git repo + mini package to prove symbol-level precision and the three states; a guarded test checks the real #74/#75 refs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* feat(synth): deterministic synthetic cell-image generator for benchmarking Add `cp_measure.synth.generate(image_size, n_objects, n_channels, seed)` — the shipped, importable generator for the PR-benchmark action (build step 1). Produces a cell-like contiguous label mask (organic star-shaped cells placed by gap-respecting dart-throwing, log-normal sizes, no degenerate ~1px objects) plus intensity channels built from a shared smooth envelope + shared/independent multi-scale Gaussian splats, so area, intensity, texture AND colocalisation features all carry real signal. Output is a pure function of the inputs (version stamped via `__version__`); placement is capacity-checked and raises loudly rather than silently under-placing. test/test_synth.py replaces the design's "eyeball the examples" gate with programmatic acceptance asserts at the matrix corners (min-size×max-count, max-size×min-count): determinism, contiguous exact count, no degenerate objects, shape/texture/intensity signal, and a controlled sub-unity channel correlation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(synth): review hardening — single cell-extent, sturdier tests, determinism Apply the "fix now" set from the max-effort review of the generator (no behavioural bugs were found; these harden maintainability, the test net, and cross-version reproducibility): - Extract `_cell_extent(base_r, amps)` as the single definition of a cell's radial reach, used by both the packing radius (worst-case amps) and the rasterisation window (actual amps). Removes the reach-vs-bulge drift risk that could silently break the no-overlap guarantee if one formula were edited. - Strengthen the two toothless tests: texture now asserts median per-object std is well ABOVE the read-noise floor (a splat-removed regression collapses to ~noise and fails); organic-shape now asserts a boundary radial-roughness CV that plain disks fail (the old solidity<0.99 passed for pixelated disks). Both verified to fail on their intended regressions. - Determinism: stable sort for tied radii; replace rng.choice(p=...) with inverse-CDF sampling on rng.random (version-stable draw count) so two separately-installed envs can't diverge. Bump __version__ 0.1.0 -> 0.2.0. - Widen the brittle seed-averaged correlation band (0.4-0.7 -> 0.35-0.8) so a legitimate constant re-tune doesn't flip it. - Per decisions: keep realistic PSF splat bleed; drop the unimplemented "clusters" docstring claim. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(bench): symbol-level PR target mapper (build step 2) `python -m cp_measure._bench.targets --base <ref> --head <ref>` resolves a PR diff to exactly the measurement functions it changes, for the benchmark action. Resolution is SYMBOL-level, not file-level: it builds a static symbol-reference graph over the package (AST, resolving intra-package imports incl. submodule and relative imports) and selects a feature iff its call graph transitively reaches a changed symbol. So a shared-helper edit (e.g. utils._zernike_scores) selects only the features that actually use it — verified on the real PRs: #74 -> {zernike}, #75 -> {radial_zernikes}, where file-closure would have over-selected the ~6 features whose modules merely import utils. - Rooted at an explicit entry-point table (the get_* registry) so bulk.py's lazy numba/multimask imports can't cause an entry-point to be missed; a test cross-checks the table against the live registries by function identity. - Reads everything from git refs (git show), diffing against the merge-base, so it matches CI and is correct for stacked PRs given the PR's real base. - Three distinct states: benchmarked / skipped-unsupported (multimask, numba) / empty — a multimask-only PR is never mistaken for "no measurement change". - Tolerates the get_ferret->get_feret cross-branch rename via name candidates. Hermetic tests build a throwaway git repo + mini package to prove symbol-level precision and the three states; a guarded test checks the real #74/#75 refs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * revert(bench): drop the change-detection mapper; benchmark all exposed functions Per design decision: the benchmark compares at the main exposed-function level — run every public get_* feature base-vs-head and let the speedup table show what changed (~1.0x = untouched). This removes the static AST symbol-graph mapper (build step 2) entirely, along with its edge-case surface; benchmark cost is controlled by the matrix size / per-function budget instead of pre-selection. - Remove src/cp_measure/_bench/targets.py and test/test_targets.py (keep the _bench package for the upcoming runner). - Remove accidentally-committed __pycache__/*.pyc and add a .gitignore for Python bytecode (the repo had none). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(bench): fixture/runner/comparator — benchmark all get_* head-vs-main Build step 2 (v3): the benchmark core, three composable pieces. - fixtures.py: build the (image_size x object_count x seed) matrix once from the pinned synth generator, serialise to .npz with a manifest + per-array sha256 (stamps synth.__version__). Both envs load identical, checksum-verified inputs. - run.py: `python -m cp_measure._bench.run` times EVERY public get_* function (core arity-1, correlation arity-2, plus a [legacy] variant where a `legacy` param exists) over the fixtures in one environment -> JSON. Channels normalised to [0,1] (the pipeline convention; get_texture requires it). Per-call warmup + reps (min), SIGALRM per-call timeout, thread-pinning set before numpy import. Functions enumerated from the live registry at HEAD; a function that errors on synth input is recorded, not fatal. - compare.py: `python -m cp_measure._bench.compare` diffs two run JSONs into a speedup table. speedup = main/head (>1 faster); per cell takes the min then the median across seeds; classifies faster/slower/within-noise/new/removed/no-data. Untouched functions land at ~1.0x — the "what changed" signal, no mapper needed. Validated end-to-end on a smoke matrix (all 12 functions time ok incl. texture; self-compare is 1.00x). The two-worktree/two-env orchestration is step 3 (workflow). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(bench): two-job benchmark workflow + orchestration driver (build step 3) Wires fixtures -> run(head) + run(main) -> compare -> sticky PR comment. - .github/workflows/benchmark.yml: triggered by the `benchmark` label (labels need write access, so the trigger is maintainer-gated) or workflow_dispatch. Two-job split: `build` runs untrusted PR code with `permissions: {}` (no token to steal, persist-credentials off); `report` holds pull-requests/issues:write but never checks out PR code — it only renders the artifact into a sticky `<!-- cp-bench -->` comment and removes the label. fetch-depth: 0 so `main` is present; concurrency cancels superseded runs. - .github/scripts/run_benchmark.sh: installs head + main in two isolated uv envs, VENDORS head's synth.py + _bench/ into the main worktree so the generator and tooling are identical across both runs (only cp_measure.core.* differs), builds the fixtures once, runs both, compares. - fixtures.py: add CI_MATRIX (bounded for hosted-runner limits, the workflow default; full DEFAULT via dispatch) + a `python -m cp_measure._bench.fixtures` build CLI. Validated locally: script bash-syntax, YAML structure (tokenless build, gated report), fixtures CLI, full test suite. End-to-end CI run is via workflow_dispatch. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * refactor(bench): review cleanup — leaner comments/docstrings + small fixes Elegance/LOC pass over the benchmark PR (net ~-50 lines, mostly verbose docstrings) plus the real findings from the review: Fixes: - workflow: `gh api --paginate | head` SIGPIPE under pipefail could abort the comment post on PRs with many comments — use a single `?per_page=100` page + `--jq 'first'` instead. Add `if: always()` to upload-artifact so a failed run still surfaces partial output. Drop the redundant matrix default + useless cat. - run.py: build call-args INSIDE the guarded path so an input a function can't handle (e.g. a 1-channel fixture) is recorded per-cell, not fatal. Record the matrix + fixture count in meta so the comment shows which sweep ran; note the shared-fn JIT caveat for [legacy] variants. - compare.py: label the status column (was a blank header); guard head_t==0; surface the matrix scope in the header. - run_benchmark.sh: trap-based cleanup of the temp dir/worktree/venvs (was leaked). - .gitignore: ignore local benchmark artifacts (bench-out/, *.npz). Cleanup: trim the synth/bench/test module docstrings and synth's per-constant comments to their load-bearing facts; collapse generate()'s numpydoc block; drop the unused load_fixture(verify=...) flag; de-clever _norm01's constant-image path. Kept the _cell_extent single-source helper (an earlier review's no-overlap fix). 31 tests pass, ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(bench): trim to a lean regression set Consolidate the over-exhaustive acceptance tests (356 -> 110 lines, 7 tests): - synth: one invariants test (shape/dtype/contiguous count, no degenerate objects, shape+size variety, intensity/texture/coloc signal) + determinism + edges, all at a single representative config instead of parametrising every check over both matrix corners. Drop the radial-roughness disk-vs-organic discriminator (eccentricity spread + size variety still catch a broken gen). - bench: merge the fixture build/load/determinism cases, fold enumerate into the run integration test, and collapse the compare classification/render cases into one. Drop the trivial _norm01 and standalone CLI tests. * style: ruff-format with current ruff (88-col line wraps) * chore: stop tracking scratch tasks/ and .claude/ (added by mistake) * refactor(bench): report raw main/head timings, drop noise-band classification Per single-function resolution: show each function's main vs head time as mean (min-max) over reps x seeds plus the raw main/head ratio, and let the maintainer read their function directly. Removes the faster/slower/within-noise band (which a noisy/sequential run could mislabel) and any normalisation; run.py now stores just the rep times. * Revert "Merge pull request #80 from afermg/feat/synth-bench-generator" This reverts commit 3809218, reversing changes made to 7f67606. * feat(bench): synthetic-image PR performance benchmark action Single revertable unit; re-introduces the benchmark mechanic (reverted #80) with the harness-source fix folded in. - cp_measure/synth.py: deterministic synthetic cell-image generator. - cp_measure/_bench/{fixtures,run,compare}.py: build the (size x count x seed) fixture matrix, time every get_* head-vs-main, report raw mean (min-max) timings. - .github/workflows/benchmark.yml + scripts/run_benchmark.sh: label-triggered two-job workflow. The harness is checked out from main (not the PR head, which a perf PR does not carry); the PR head is fetched as a worktree, main's synth.py + _bench/ vendored in, and only cp_measure.core differs between the timed runs. * Revert "feat(bench): synthetic-image PR performance benchmark action" * demo: self-contained PR benchmark action (simplified) Everything lives on this branch (nothing on main). on: pull_request runs the workflow from the PR branch on every commit, times every public get_* on the PR head vs main, and posts a sticky comment with the timing table. - synth.py: minimal generator — n ellipses on a regular grid + a few random Gaussian blobs per channel. - _bench/{fixtures,run,compare}.py: build fixtures, time all get_* head-vs-main, raw timings table. - .github: single-job pull_request workflow (no label, no pull_request_target) + head-based driver that vendors the tooling into a main worktree. - includes the granularity speedup (#76) so the demo table shows a real delta. * demo: move the whole benchmark into .github/scripts (no package module) Remove src/cp_measure/{synth.py,_bench/} and their tests. Everything now lives in .github/scripts/benchmark.py — a single self-contained script (generator + runner + comparator); each env regenerates the same seeded inputs, so nothing is shared or vendored. Table now references the commit and emits one grid per affected function (speedup >= 1.1x) with image size as rows and object count as columns. * demo: extend benchmark matrix to 4 sizes x 2 counts (256–2048) Grid now spans image sizes 256/512/1024/2048 (rows) x object counts 16/64 (cols); bump the job timeout to 45m for the larger sizes. * demo: median per cell, 3 seeds x 3 counts, dynamic affected-threshold caption - per-cell aggregate is now the median (over seeds x reps); speedup = median/median - matrix: sizes 256-2048 (rows) x counts 16/64/256 (cols) x 3 seeds = 36 cells - caption derives the cutoff from AFFECTED (≥1.1x) instead of hardcoding >1 - job timeout 60m for the larger matrix * demo: drop 256px image size (unrealistically small) Sizes now 512/1024/2048 x counts 16/64/256 x 3 seeds = 27 cells (3x3 grid). * demo: shift matrix down to 256-1024 (drop slow 2048) Sizes 256/512/1024 x counts 16/64/256 x 3 seeds — 2048 was too slow per commit. * demo: report regressions too — flag functions that moved >=1.05x either way Was speedup>=1.1x only (regression-blind: a slowdown reported 'no change'). Now a function is shown if any cell is >=1.05x faster OR <=1/1.05x slower; header notes >1 faster / <1 slower. * demo: slim benchmark.py — drop unused bits - remove the n_channels param (always 2: ch0 for core, ch0+ch1 for coloc) - drop 'from __future__ import annotations' (unneeded on the 3.12 runner) - .gitignore: drop *.npz (no fixture files are written anymore) * revert(granunlarity): it has an independent PR, was used as test --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Alán F. Muñoz <afer.mg@gmail.com>
What
get_zernikedelegated tocentrosome.zernike.zernike, which scatters the per-pixel Zernike basis into a full(H, W, K)complex array (~560 MB at 1080², K≈30) and scores via ~60 whole-imagescipy.ndimage.sumcalls — both scaling with image area, not object pixels.This replaces it with a pure-numpy
_zernike_scoreshelper that keeps the basis on the masked foreground vectors and segment-sums each moment by label withnumpy.bincount. The Horner basis evaluation is copied verbatim fromcentrosome.construct_zernike_polynomials(same lookup table,r²>1cutoff,z = y + i·xconvention), so results track centrosome to round-off — bit-exact in nearly all cases, worst case ~1e-16.Performance
Speedup scales with foreground fraction (object pixels ÷ image area):
Notes
_zernike_scores(masks, zernike_indexes, weight=None)returns(real_sums, imag_sums, radii, counts)— a shared primitive:get_zernikenormalises byπ·radii², the intensity-weighted radial Zernikes (PR perf(radial): vectorize get_radial_zernikes via shared _zernike_scores (~2x) #75) bycountsvia the optionalweight. Weighted + count paths are golden-tested against centrosome.