perf: scatter groupby-sum terms directly instead of unstacking#793
perf: scatter groupby-sum terms directly instead of unstacking#793FBumann wants to merge 5 commits into
Conversation
The fast path of LinearExpression.groupby(...).sum() used ds.unstack(group_dim, fill_value=...) followed by a stack, which materializes 2-3 intermediate copies of the padded result (n_groups x max_group_size x nterm) and goes through pandas MultiIndex machinery sized by the number of elements. Instead, factorize the groups and scatter coeffs/vars directly into the preallocated padded result arrays; constants are group-summed with np.add.at. Peak memory drops to input + result (the minimum for the padded layout) and the grouping itself gets considerably faster. The result is unchanged: same dims, coords, term ordering and padding. The unstack-based implementation is kept as _sum_by_unstack and still used for chunked (dask-backed) data, which cannot be scattered into numpy arrays. NaN group labels now raise an informative ValueError instead of failing inside unstack. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merging this PR will improve performance by ×2.1
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | Memory | test_to_lp[nodal_balance-severity=100] |
17.9 MB | 6 MB | ×3 |
| ⚡ | Memory | test_to_lp[nodal_balance-severity=50] |
9.2 MB | 3.1 MB | ×3 |
| ⚡ | Memory | test_to_lp[nodal_balance-severity=0] |
385.3 KB | 135.3 KB | ×2.8 |
| ⚡ | Memory | test_build[nodal_balance-severity=100] |
32 MB | 12.8 MB | ×2.5 |
| ⚡ | Memory | test_build[nodal_balance-severity=50] |
16.8 MB | 7 MB | ×2.4 |
| ⚡ | Memory | test_to_solver[highs-nodal_balance-severity=100] |
24.9 MB | 13.3 MB | +87.47% |
| ⚡ | Memory | test_to_solver[gurobi-nodal_balance-severity=100] |
25.1 MB | 13.5 MB | +86.1% |
| ⚡ | Memory | test_to_solver[highs-nodal_balance-severity=50] |
12.9 MB | 7.1 MB | +81.68% |
| ⚡ | Memory | test_to_solver[gurobi-nodal_balance-severity=50] |
13.1 MB | 7.3 MB | +79.32% |
| ⚡ | Memory | test_build[nodal_balance-severity=0] |
1.4 MB | 1.2 MB | +19.65% |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing fluxopt:perf/groupby-sum-scatter-upstream (81d7c23) with master (fe798b1)
Footnotes
-
138 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Add a test for grouping over an empty group dimension, which the scatter fast path handles cleanly but the unstack fallback cannot. Trim comments that duplicated the helper docstrings.
d5d1c5a to
7598180
Compare
FabianHofmann
left a comment
There was a problem hiding this comment.
@FBumann this is super strong. I added release notes and cleaned up some comments. feel free to merge.
coroa
left a comment
There was a problem hiding this comment.
Cool, need to wrap my head around it
| scattered into preallocated numpy arrays) and no coordinates tied to | ||
| the grouped dimension besides its own index. Everything else falls |
There was a problem hiding this comment.
i don't understand the second condition. What means no coordinates tied to the grouped dimension besides its own index?
There was a problem hiding this comment.
ie. can we please improve the docstring
There was a problem hiding this comment.
My thinking is, that if the second condition does basically not happen, then we could wrap _can_sum_by_scatter into an xr.apply_ufunc, similar to https://github.com/PyPSA/atlite/blob/966380825089e5f913eb3dd45141abbff4fc075d/atlite/gis.py#L1077-L1089 , and get rid of the groupby path entirely.
|
Alternatively, we might think about deprecating the dask layer. |
|
As im working on the v1 convention and the multiindex stuff, this might even become obsolete indirection. I need to think about it |
Relax the groupby-sum scatter gate to a pure numpy/dask check: auxiliary coordinates on the grouped dimension no longer force the slow unstack path. Summing over groups collapses that dimension, so both kernels drop every coordinate tied to it — the scatter result is identical, just cheaper. The unstack kernel now serves only chunked (dask) data, and a debug log records when that fallback is taken. Inline the now-trivial predicate into the dispatch and consolidate the kernel tests into a TestGroupbySumScatterKernel class: a one-line case table over a shared fixture, with added coverage for combined structures, auxiliary coords, and a MultiIndex grouped dimension. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Note
The following content was generated by AI.
What this does
The fast path of
LinearExpression.groupby(...).sum()previously didds.unstack(group_dim, fill_value=...)followed by a stack. Thatmaterializes 2–3 intermediate copies of the padded result
(
n_groups × max_group_size × nterm) and routes through pandasMultiIndexmachinery sized by the number of elements.This change factorizes the groups and scatters
coeffs/varsdirectlyinto preallocated padded result arrays; constants are group-summed with
np.add.at. Peak memory drops to input + result (the minimum for thepadded layout), and the grouping itself gets considerably faster. The
result is unchanged: same dims, coords, term ordering and padding.
The unstack-based implementation is kept as
_sum_by_unstackand isstill used for chunked (dask-backed) data, which cannot be scattered into
numpy arrays.
NaNgroup labels now raise an informativeValueErrorinstead of failing inside
unstack.Notes
linopy/expressions.pyand adds tests intest/test_linear_expression.py(124 new lines)._sum_by_unstackretains the currentmasternames_to_droplogic(drop every coordinate aligned to
group_dim), so the slow path keepsthe existing behavior.
Verification
pytest test/test_linear_expression.py→ 309 passed-k "group or sum or scatter or unstack") → 73 passedruff check,ruff format --check,mypy linopy/expressions.py→ clean