feat(lcao): auto-recommend nb2d from ScaLAPACK 2D block-cyclic load balance#7480
Open
A-006 wants to merge 8 commits into
Open
feat(lcao): auto-recommend nb2d from ScaLAPACK 2D block-cyclic load balance#7480A-006 wants to merge 8 commits into
A-006 wants to merge 8 commits into
Conversation
…alance Derive a healthy nb2d window from the per-pool process grid (q) and matrix dim N=nlocal: nb_lo=min(16,N/(2q)) <= nb2d <= nb_hi=floor(N/(2q)), recommended nb2d=min(64,nb_hi). Avoids both the large-block load-imbalance cliff and the nb=1 over-scatter end of the U-shaped time-vs-nb2d curve. Prints the recommendation in the run-info banner; energy is unaffected (performance-only). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds runtime guidance to improve parallel performance for LCAO calculations by detecting problematic ScaLAPACK 2D block sizes (nb2d) and recommending an optimal kpar value for k-point pooling.
Changes:
- Add a ScaLAPACK
nb2dload-balance check (with optional auto-adjust whennb2d==0) during LCAO basis initialization. - Print an advisory recommendation for
kparbased on(NPROC, nks)to better parallelize the k-point loop.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| source/source_lcao/LCAO_init_basis.cpp | Adds nb2d “cliff” detection and optional auto-tuning for ScaLAPACK GVX diagonalization. |
| source/source_io/module_output/print_info.cpp | Adds a kpar recommendation message based on divisors of NPROC bounded by nks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- LCAO_init_basis: skip the nb2d load-balance heuristic when NPROC is not divisible by kpar (pool size ill-defined) instead of using a truncated np_total/kpar that yields a wrong process-grid estimate. - print_info: start the kpar divisor scan at min(nks, NPROC) so out-of-range candidates are skipped rather than scanning all of [1, NPROC]. - print_info: drop the ModuleBase::WARNING for a non-optimal kpar -- it is advisory only and the truly problematic cases (nks % kpar != 0, kpar > nks) are already warned about above. Keep the running-log advisory line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
Addressed the Copilot review (commit 92e2b1e):
|
The nb2d auto-adjust ignored the return value of set_nloc_wfc_Eij. When the recommended nb_opt (derived from matrix dim + grid only) makes ceil(nbands/nb_opt) smaller than the process-grid width, set_nloc_wfc_Eij returns early without updating ncol_bands/nrow_bands/nloc_wfc, leaving pv half-updated (new nb/nrow from pv.set, stale band sizes). This segfaulted the gamma-only scalapack_gvx cases (02_NAO_Gamma: scf_bsse, scf_force_stress, scf_out_hk, scf_out_hxc; N=26, nbands=6, 2x2 grid -> nb_opt=6 -> block=1<dim1). Validate the new block size like the initial distribution and revert to the previously validated nb2d on failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…pv grid
The nb2d load-balance check estimated the per-pool process count as
pv.dim0 * pv.dim1 / kpar. But pv is built on DIAG_WORLD (size = diago_proc),
whereas hsolver's parakSolve re-splits MPI_COMM_WORLD into kpar pools of
NPROC/kpar ranks each (Parallel_K2D -> divide_mpi_groups(NPROC, kpar)),
independent of diago_proc. So whenever diago_proc < NPROC the old expression
gave diago_proc/kpar != NPROC/kpar -- a wrong (p_row, p_col) grid and hence a
wrong nb_hi/nb_lo window and auto-adjust decision.
Take the pool size from the actual source instead:
- kpar == 1: the diagonalization runs on this ParaV grid -> pv.dim0 * pv.dim1.
- kpar > 1: NPROC / kpar (the parakSolve pool size); skip when NPROC % kpar
!= 0 since the pools are then uneven and have no single size.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
For nspin==4 the basis carries 2-component spinors that must stay paired within a ScaLAPACK block, so nb2d must be a multiple of 2 (the autoset and fallback already use 2, not 1, for this reason). The nb2d recommendation nb_opt = min(64, floor(N/2q)) could be odd (e.g. N=108, q=2 -> 27), which broke the spinor blocking and segfaulted the nspin==4 diagonalization (03_NAO_multik: scf_u_spin4, scf_out_dos_spin4, scf_out_mul_spin4). Snap the whole nb2d window (nb_lo/nb_hi/nb_opt) to the nspin granularity. Verified locally in the abacus-gnu Docker image: 02_NAO_Gamma, 03_NAO_multik and 05_rtTDDFT integration suites all pass 100%. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trim the verbose explanatory comments in the nb2d block to the essential rationale; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove the kpar recommendation added earlier; this PR now scopes to the nb2d auto-recommendation in LCAO_init_basis.cpp only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match the merge-base so this PR no longer touches print_info.cpp at all; develop's own later edits to this file are left intact for the merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Automatically recommend (and, when
nb2dis auto, apply) a healthynb2d(ScaLAPACK 2D block-cyclic block size) for LCAO +
ks_solver=scalapack_gvx,derived from the per-pool process grid and the matrix dimension.
Why
The time-vs-
nb2dcurve is U-shaped: large blocks hit a load-imbalance cliff,while
nb2d=1over-scatters the matrix. Users currently have to tunenb2dbyhand. This derives a sensible window automatically.
How
From the per-pool process grid long edge
qand matrix dimN = nlocal:nb_hi = floor(N / (2q))(keep >= 2 blocks per process)nb_lo = min(16, nb_hi)nb2d = min(64, nb_hi), constrained to[nb_lo, nb_hi]Behavior:
nb2d(!= 0): value is kept; only a warning is emitted if itis outside the window.
nb2d(== 0):pvis re-distributed to the recommended value(
pv.nbfeeds both thekpar==1path and the per-poolParallel_K2D).Scope / why only
scalapack_gvxnb2dsets the H/S block-cyclic storage layout for all LCAO solvers, but theload-balance cliff is specific to ScaLAPACK's dense
p?sygvx. genelpa/elpaalso diagonalize on the 2D grid but are robust to the block size; lapack /
cusolver / pexsi do not run this distributed dense diagonalization. So the
auto-adjust is restricted to
scalapack_gvx; the pre-existingnb2dautoset(the
500/1000thresholds) is left untouched and still serves every solver.Correctness fixes (after initial CI)
nb_optis validated like the initialdistribution (
set_nloc_wfc_Eij); if it is incompatible with the band/gridlayout (
ceil(nbands/nb_opt) < grid width), revert to the previouslyvalidated value instead of leaving
pvhalf-updated (fixes gamma-onlysegfaults, e.g.
02_NAO_Gamma/scf_bsse).(
NPROC/kpar), notpv.dim0*pv.dim1/kpar.nspin==4: snap the whole window to a multiple of 2 so the2-component spinor blocking is never broken by an odd
nb2d(fixes03_NAO_multikspin4 segfaults).Impact
scalapack_gvxsolvers, nor when theuser sets
nb2dexplicitly.Verified
Built in the
abacus-gnuimage;02_NAO_Gamma,03_NAO_multik,05_rtTDDFTintegration suites pass 100%.
Files
source/source_lcao/LCAO_init_basis.cpp