Skip to content

feat(lcao): auto-recommend nb2d from ScaLAPACK 2D block-cyclic load balance#7480

Open
A-006 wants to merge 8 commits into
deepmodeling:developfrom
A-006:feature/auto-recommend-nb2d
Open

feat(lcao): auto-recommend nb2d from ScaLAPACK 2D block-cyclic load balance#7480
A-006 wants to merge 8 commits into
deepmodeling:developfrom
A-006:feature/auto-recommend-nb2d

Conversation

@A-006

@A-006 A-006 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

What

Automatically recommend (and, when nb2d is auto, apply) a healthy nb2d
(ScaLAPACK 2D block-cyclic block size) for LCAO + ks_solver=scalapack_gvx,
derived from the per-pool process grid and the matrix dimension.

Why

The time-vs-nb2d curve is U-shaped: large blocks hit a load-imbalance cliff,
while nb2d=1 over-scatters the matrix. Users currently have to tune nb2d by
hand. This derives a sensible window automatically.

How

From the per-pool process grid long edge q and matrix dim N = nlocal:

  • nb_hi = floor(N / (2q)) (keep >= 2 blocks per process)
  • nb_lo = min(16, nb_hi)
  • recommended nb2d = min(64, nb_hi), constrained to [nb_lo, nb_hi]

Behavior:

  • user-set nb2d (!= 0): value is kept; only a warning is emitted if it
    is outside the window.
  • auto nb2d (== 0): pv is re-distributed to the recommended value
    (pv.nb feeds both the kpar==1 path and the per-pool Parallel_K2D).

Scope / why only scalapack_gvx

nb2d sets the H/S block-cyclic storage layout for all LCAO solvers, but the
load-balance cliff is specific to ScaLAPACK's dense p?sygvx. genelpa/elpa
also diagonalize on the 2D grid but are robust to the block size; lapack /
cusolver / pexsi do not run this distributed dense diagonalization. So the
auto-adjust is restricted to scalapack_gvx; the pre-existing nb2d autoset
(the 500/1000 thresholds) is left untouched and still serves every solver.

Correctness fixes (after initial CI)

  • Validate & revert: the recommended nb_opt is validated like the initial
    distribution (set_nloc_wfc_Eij); if it is incompatible with the band/grid
    layout (ceil(nbands/nb_opt) < grid width), revert to the previously
    validated value instead of leaving pv half-updated (fixes gamma-only
    segfaults, e.g. 02_NAO_Gamma/scf_bsse).
  • Per-pool grid: derive the per-pool size from the real pool communicator
    (NPROC/kpar), not pv.dim0*pv.dim1/kpar.
  • Noncollinear nspin==4: snap the whole window to a multiple of 2 so the
    2-component spinor blocking is never broken by an odd nb2d (fixes
    03_NAO_multik spin4 segfaults).

Impact

  • Performance-only. Energy/results are unaffected (bit-identical).
  • No change to default behavior for non-scalapack_gvx solvers, nor when the
    user sets nb2d explicitly.

Verified

Built in the abacus-gnu image; 02_NAO_Gamma, 03_NAO_multik, 05_rtTDDFT
integration suites pass 100%.

Files

  • source/source_lcao/LCAO_init_basis.cpp

…alance

Derive a healthy nb2d window from the per-pool process grid (q) and matrix dim N=nlocal: nb_lo=min(16,N/(2q)) <= nb2d <= nb_hi=floor(N/(2q)), recommended nb2d=min(64,nb_hi). Avoids both the large-block load-imbalance cliff and the nb=1 over-scatter end of the U-shaped time-vs-nb2d curve. Prints the recommendation in the run-info banner; energy is unaffected (performance-only).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 17, 2026 04:43

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds runtime guidance to improve parallel performance for LCAO calculations by detecting problematic ScaLAPACK 2D block sizes (nb2d) and recommending an optimal kpar value for k-point pooling.

Changes:

  • Add a ScaLAPACK nb2d load-balance check (with optional auto-adjust when nb2d==0) during LCAO basis initialization.
  • Print an advisory recommendation for kpar based on (NPROC, nks) to better parallelize the k-point loop.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
source/source_lcao/LCAO_init_basis.cpp Adds nb2d “cliff” detection and optional auto-tuning for ScaLAPACK GVX diagonalization.
source/source_io/module_output/print_info.cpp Adds a kpar recommendation message based on divisors of NPROC bounded by nks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/source_lcao/LCAO_init_basis.cpp
Comment thread source/source_lcao/LCAO_init_basis.cpp Outdated
Comment thread source/source_io/module_output/print_info.cpp Outdated
Comment thread source/source_io/module_output/print_info.cpp Outdated
- LCAO_init_basis: skip the nb2d load-balance heuristic when NPROC is not
  divisible by kpar (pool size ill-defined) instead of using a truncated
  np_total/kpar that yields a wrong process-grid estimate.
- print_info: start the kpar divisor scan at min(nks, NPROC) so out-of-range
  candidates are skipped rather than scanning all of [1, NPROC].
- print_info: drop the ModuleBase::WARNING for a non-optimal kpar -- it is
  advisory only and the truly problematic cases (nks % kpar != 0, kpar > nks)
  are already warned about above. Keep the running-log advisory line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@A-006

A-006 commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed the Copilot review (commit 92e2b1e):

  • np_pool integer truncation — the nb2d load-balance heuristic is now skipped when NPROC % kpar != 0 (pool size ill-defined) instead of estimating the grid from a truncated np_total / kpar.
  • O(NPROC) divisor scan — the kpar scan now starts at min(nks, NPROC) and walks down, skipping the out-of-range candidates.
  • Noisy WARNING — downgraded to the running-log advisory line only; the genuinely problematic cases (nks % kpar != 0, kpar > nks) are already warned about above.
  • try_nb capture — replied inline: false positive (try_nb is not read after that block; pv is updated in place).

liutao and others added 6 commits June 17, 2026 13:30
The nb2d auto-adjust ignored the return value of set_nloc_wfc_Eij. When the
recommended nb_opt (derived from matrix dim + grid only) makes
ceil(nbands/nb_opt) smaller than the process-grid width, set_nloc_wfc_Eij
returns early without updating ncol_bands/nrow_bands/nloc_wfc, leaving pv
half-updated (new nb/nrow from pv.set, stale band sizes). This segfaulted the
gamma-only scalapack_gvx cases (02_NAO_Gamma: scf_bsse, scf_force_stress,
scf_out_hk, scf_out_hxc; N=26, nbands=6, 2x2 grid -> nb_opt=6 -> block=1<dim1).

Validate the new block size like the initial distribution and revert to the
previously validated nb2d on failure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pv grid

The nb2d load-balance check estimated the per-pool process count as
pv.dim0 * pv.dim1 / kpar. But pv is built on DIAG_WORLD (size = diago_proc),
whereas hsolver's parakSolve re-splits MPI_COMM_WORLD into kpar pools of
NPROC/kpar ranks each (Parallel_K2D -> divide_mpi_groups(NPROC, kpar)),
independent of diago_proc. So whenever diago_proc < NPROC the old expression
gave diago_proc/kpar != NPROC/kpar -- a wrong (p_row, p_col) grid and hence a
wrong nb_hi/nb_lo window and auto-adjust decision.

Take the pool size from the actual source instead:
  - kpar == 1: the diagonalization runs on this ParaV grid -> pv.dim0 * pv.dim1.
  - kpar  > 1: NPROC / kpar (the parakSolve pool size); skip when NPROC % kpar
               != 0 since the pools are then uneven and have no single size.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
For nspin==4 the basis carries 2-component spinors that must stay paired
within a ScaLAPACK block, so nb2d must be a multiple of 2 (the autoset and
fallback already use 2, not 1, for this reason). The nb2d recommendation
nb_opt = min(64, floor(N/2q)) could be odd (e.g. N=108, q=2 -> 27), which
broke the spinor blocking and segfaulted the nspin==4 diagonalization
(03_NAO_multik: scf_u_spin4, scf_out_dos_spin4, scf_out_mul_spin4).

Snap the whole nb2d window (nb_lo/nb_hi/nb_opt) to the nspin granularity.

Verified locally in the abacus-gnu Docker image: 02_NAO_Gamma, 03_NAO_multik
and 05_rtTDDFT integration suites all pass 100%.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trim the verbose explanatory comments in the nb2d block to the essential
rationale; no code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the kpar recommendation added earlier; this PR now scopes to the
nb2d auto-recommendation in LCAO_init_basis.cpp only.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match the merge-base so this PR no longer touches print_info.cpp at all;
develop's own later edits to this file are left intact for the merge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants