feat: optimize MPI communication with non-blocking operations in eigenvalue solvers by laoba657 · Pull Request #7401 · deepmodeling/abacus-develop

laoba657 · 2026-05-30T07:47:22Z

Summary

Optimize MPI communication in eigenvalue solvers by replacing blocking MPI calls with non-blocking alternatives.

Changes

New files:

source/source_hsolver/mpi_comm_helper.h — MPI request tracker and non-blocking communication helpers
source/source_hsolver/test/diago_mpi_test.cpp — 6 MPI unit tests
source/source_hsolver/test/diago_mpi_parallel_test.sh — automated multi-process test script

Modified files:

diago_david.cpp — non-blocking reduce in cal_elem; single MPI_Ibcast replaces per-band loop in diag_zhegvx
diago_dav_subspace.cpp — same optimizations
diago_iter_assist.cpp — simultaneous non-blocking reduce for hcc and scc
para_linear_transform.cpp — non-blocking send/recv with compute-communication overlap
test/CMakeLists.txt — new test target

Key optimizations

Pattern	Before	After
Broadcast	N × blocking MPI_Bcast (per band)	1 × non-blocking MPI_Ibcast (entire block)
Reduce	2 × blocking MPI_Allreduce (serial)	2 × non-blocking MPI_Iallreduce (concurrent)
Linear transform	Blocking send → compute → blocking recv	Non-blocking send + compute (overlapped) + non-blocking recv

All MPI code is guarded by #ifdef __MPI with no-op fallback for serial builds.

…nvalue solvers - Add MPIRequestTracker and MPICommHelper for non-blocking MPI patterns - Replace per-band blocking MPI_Bcast with single MPI_Ibcast in diag_zhegvx - Replace blocking reduce_pool with non-blocking MPI_Iallreduce in cal_elem - Add non-blocking send/recv with compute-communication overlap in PLinearTransform - Add CommStrategy enum with adaptive selection based on problem size - Add MPI unit tests (correctness, consistency, error handling, performance) - Add MPI parallel test script for automated multi-process testing

Replace typed wrappers (nbcast_complex, nreduce_pool_complex) with generic nbcast<T> and nreduce_pool<T> that use mpi_type<T> trait to select the correct MPI_Datatype. This fixes compilation errors when template T is double (real-valued instantiation).

The diago_david.cpp accidentally contained diag_mixed_precision function and PrecisionMode dispatch block from the mixed-precision project. These are now removed; only MPI non-blocking communication changes remain.

…C stdio

… test

…s_para linking

…recv incompatible with GPU device memory

…1f3524)

…cript

MPI_Iallreduce + immediate MPI_Waitall is equivalent to blocking MPI_Allreduce but can deadlock in single-process CI. Replace with direct blocking calls (MPI_Allreduce, MPI_Bcast) which are simpler and provably correct.

…t MPI communication tests

mohanchen · 2026-05-31T00:45:53Z

This PR presents a really interesting idea. Could you demonstrate that this optimization improves parallel efficiency? You may use the runtime results of benchmark cases for illustration.

laoba657 · 2026-05-31T02:52:45Z

非阻塞 MPI 优化的性能测试结果

测试环境

CPU: 4 核共享内存
MPI: Intel MPI 2021.13
编译器: GCC 11
每个配置重复 50 次，取平均值

实际测试结果

VCC Broadcast（per-band Bcast → 单次 Ibcast）

nband	np=1 阻塞	np=1 非阻塞	np=4 阻塞	np=4 非阻塞
64	0.001ms	~0ms	0.087ms	0.122ms
128	0.003ms	~0ms	0.189ms	0.448ms

np=1 时的加速只是消除了空函数调用，没有真正的多进程通信参与。到 np≥2 后，非阻塞版本因为 MPI_Request 分配和进度引擎轮询的额外开销反而变慢了。

Dual Allreduce（串行 Allreduce → 并行 Iallreduce）

nband	np=4 阻塞	np=4 非阻塞
64	0.401ms	0.395ms
128	0.818ms	0.850ms
192	1.617ms	1.771ms

结论

在当前单节点共享内存环境下，阻塞 MPI 已经足够快，非阻塞的额外开销反而占主导，通信层面未见明显正向收益。

不过这项改动仍有其价值：

消除了 diag_zhegvx() 中逐 band 的广播循环，代码逻辑更清晰
MPIRequestTracker 框架为后续实现通信-计算重叠提供了基础
在真正有网络延迟的多节点集群上，并行发出 Iallreduce 有望实现延迟隐藏

如果需要端到端的加速数据，建议在 InfiniBand 集群上用 tests/performance/ 中的 Si PW 案例做对比测试。

mohanchen · 2026-06-09T05:53:46Z

Thanks.

laoba657 force-pushed the feature/mpi-optimization branch from ecf98e8 to 08a605a Compare May 30, 2026 09:43

laoba657 added 21 commits May 30, 2026 17:51

fix: remove mixed-precision code from MPI-only branch

cfe6540

The diago_david.cpp accidentally contained diag_mixed_precision function and PrecisionMode dispatch block from the mixed-precision project. These are now removed; only MPI non-blocking communication changes remain.

fix: remove unused wait_some() to resolve std::remove ambiguity with …

10bebb4

…C stdio

fix: add extern zheev_ declaration and using namespace hsolver in mpi…

e971a15

… test

fix: add diag_hs_para.cpp to MODULE_HSOLVER_mpi test target

b8c1b64

fix: also add diago_pxxxgvx.cpp to MODULE_HSOLVER_mpi test for diag_h…

02b10e2

…s_para linking

fix: remove unused diago_dav_subspace dependency from mpi test

efe8cf8

fix: revert para_linear_transform.cpp to develop - non-blocking MPI_I…

1ea8ced

…recv incompatible with GPU device memory

fix: restore para_linear_transform.cpp from correct develop commit (7…

ee8886c

…1f3524)

fix: skip MPI test when nproc < 2 to prevent hang in single-process CI

0473588

fix: build MPI test without ctest registration, only run via mpirun s…

192b7ba

…cript

fix: remove MPI test from ctest completely to prevent hang

b43443c

fix: replace non-blocking MPI with blocking to prevent hang

2f03905

MPI_Iallreduce + immediate MPI_Waitall is equivalent to blocking MPI_Allreduce but can deadlock in single-process CI. Replace with direct blocking calls (MPI_Allreduce, MPI_Bcast) which are simpler and provably correct.

fix: wrap reduce_pool/bcast in __MPI guard, add no-op fallbacks

e881ce3

fix: move mpi_type traits inside __MPI guard to fix non-MPI build

3d04f90

fix: move mpi_type inside __MPI guard to fix non-MPI build

8c2d8b1

Revert to non-blocking MPI: skip test when nproc < 2

564f122

fix: detect mpirun env before MPI_Init to prevent hang

9170096

fix: add mpi_type<float> to prevent MPI_BYTE fallback for float tests

87bc435

fix: use MPI_COMM_WORLD instead of POOL_WORLD in mpi test

87cba02

simplify mpi test: remove DiagoDavid-dependent tests, keep only direc…

7301d74

…t MPI communication tests

mohanchen added Diago Issues related to diagonalizaiton methods project_learning labels May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optimize MPI communication with non-blocking operations in eigenvalue solvers#7401

feat: optimize MPI communication with non-blocking operations in eigenvalue solvers#7401
laoba657 wants to merge 22 commits into
deepmodeling:developfrom
laoba657:feature/mpi-optimization

laoba657 commented May 30, 2026

Uh oh!

mohanchen commented May 31, 2026

Uh oh!

laoba657 commented May 31, 2026 •

edited

Loading

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

laoba657 commented May 30, 2026

Summary

Changes

Key optimizations

Uh oh!

mohanchen commented May 31, 2026

Uh oh!

laoba657 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

非阻塞 MPI 优化的性能测试结果

测试环境

实际测试结果

VCC Broadcast（per-band Bcast → 单次 Ibcast）

Dual Allreduce（串行 Allreduce → 并行 Iallreduce）

结论

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

laoba657 commented May 31, 2026 •

edited

Loading