bench: arithmetic operation micro-benchmarks#805
Conversation
An op-level tier alongside the whole-model builds: one benchmark per (operation, size profile), operands built outside the measured region so a run isolates a single op rather than a whole build. This attributes perf changes to a specific arithmetic path — a build benchmark says "kvl got heavier", an op benchmark says "expr+expr broadcast got heavier". - benchmarks/ops.py: op registry (OpSpec) + size profiles (small 1D×2000; large 3D×3×4×1000 — differ in element count *and* dim count; the asymmetric shape also catches dim-order bugs) + ~30 ops across scaling, var/expr arithmetic, quadratic, reductions and constraint construction. Binary labelled ops carry match/broadcast variants — the alignment-path axis where the interesting regressions live. - benchmarks/drivers/test_ops.py: parametrized driver, one benchmark per (op, profile). - conftest: add test_ops to CODSPEED_MODULES (tracked; memory advisory). 60 benchmarks, ~80s/run with memory. Signal validates: large ≈ 6× small, broadcast ≈ 5× match (the §9 cross-product).
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
Collapse to one 3-D profile (3×4×1000, ~12 K elements) — CodSpeed records time *and* memory per benchmark, so a second size wasn't buying a separate signal; one multi-dim profile keeps broadcast/alignment coverage with MB-scale ops above the noise floor, and halves the matrix. Benchmark ids drop the size suffix. Add three categories: absence/masking (expr.where / fillna / absence propagation — §4–§7, the semantics-heavy surface), groupby.sum, and an N-way merge (constraint-assembly cost). 35 ops, ~45 s/run with memory.
|
Note The following content was generated by AI. CodSpeed cost check. This PR adds 35 arithmetic micro-benchmarks (one
The bare-metal walltime job remains gated to |
I really want to make sure the v1 convention doesnt introduce regressions to linopy. And i think as soon as we decide on what to do with the multiindex stuff, there will be room for improval. THis should lay a nie foundation for that.
Codspeed cost is really small, but coverage (and granularity) really improves.
Note
AI-assisted (Claude Code): implementation and this description; reviewed by me.
Adds an op-level benchmark tier to
benchmarks/, alongside the whole-model build benchmarks. One benchmark per operation, with operands built outside the measured region so a run isolates a single arithmetic op.Why. Whole-build benchmarks catch a regression but can't attribute it — a build says "kvl got heavier", an op benchmark says "
expr+exprbroadcast got heavier". (Motivated by a real regression hunt where attribution needed exactly this granularity.)What.
benchmarks/ops.py— op registry (OpSpec) + a single 3-Dgridsize profile (dims 3×4×1000; the asymmetric shape also catches dim-order/transpose bugs) + 35 ops across scaling, var/expr arithmetic, quadratic, reductions, masking, groupby, merge, and constraint construction. Binary labelled ops carrymatch/broadcastvariants — the alignment-path axis where the interesting regressions live.benchmarks/drivers/test_ops.py— parametrized driver, one benchmark per op.conftest.py—test_opsadded toCODSPEED_MODULES(tracked; memory advisory).Cost. 35 benchmarks; the memory run stays ~2–2.5 min including the model builds that dominate the job — cheap.
Signal validated.
broadcast ≈ 5× matchon the alignment axis (the §9 cross-product) — well above the noise floor.Memory is report-only to start (op-scale memory can be noisy); op-time is the natural gate candidate once the signal proves stable.