Skip to content

feat: optimize MPI communication with non-blocking operations in eigenvalue solvers#7401

Open
laoba657 wants to merge 22 commits into
deepmodeling:developfrom
laoba657:feature/mpi-optimization
Open

feat: optimize MPI communication with non-blocking operations in eigenvalue solvers#7401
laoba657 wants to merge 22 commits into
deepmodeling:developfrom
laoba657:feature/mpi-optimization

Conversation

@laoba657

Copy link
Copy Markdown

Summary

Optimize MPI communication in eigenvalue solvers by replacing blocking MPI calls with non-blocking alternatives.

Changes

New files:

  • source/source_hsolver/mpi_comm_helper.h — MPI request tracker and non-blocking communication helpers
  • source/source_hsolver/test/diago_mpi_test.cpp — 6 MPI unit tests
  • source/source_hsolver/test/diago_mpi_parallel_test.sh — automated multi-process test script

Modified files:

  • diago_david.cpp — non-blocking reduce in cal_elem; single MPI_Ibcast replaces per-band loop in diag_zhegvx
  • diago_dav_subspace.cpp — same optimizations
  • diago_iter_assist.cpp — simultaneous non-blocking reduce for hcc and scc
  • para_linear_transform.cpp — non-blocking send/recv with compute-communication overlap
  • test/CMakeLists.txt — new test target

Key optimizations

Pattern Before After
Broadcast N × blocking MPI_Bcast (per band) 1 × non-blocking MPI_Ibcast (entire block)
Reduce 2 × blocking MPI_Allreduce (serial) 2 × non-blocking MPI_Iallreduce (concurrent)
Linear transform Blocking send → compute → blocking recv Non-blocking send + compute (overlapped) + non-blocking recv

All MPI code is guarded by #ifdef __MPI with no-op fallback for serial builds.

…nvalue solvers

- Add MPIRequestTracker and MPICommHelper for non-blocking MPI patterns
- Replace per-band blocking MPI_Bcast with single MPI_Ibcast in diag_zhegvx
- Replace blocking reduce_pool with non-blocking MPI_Iallreduce in cal_elem
- Add non-blocking send/recv with compute-communication overlap in PLinearTransform
- Add CommStrategy enum with adaptive selection based on problem size
- Add MPI unit tests (correctness, consistency, error handling, performance)
- Add MPI parallel test script for automated multi-process testing
@laoba657 laoba657 force-pushed the feature/mpi-optimization branch from ecf98e8 to 08a605a Compare May 30, 2026 09:43
laoba657 added 21 commits May 30, 2026 17:51
Replace typed wrappers (nbcast_complex, nreduce_pool_complex) with
generic nbcast<T> and nreduce_pool<T> that use mpi_type<T> trait
to select the correct MPI_Datatype. This fixes compilation errors
when template T is double (real-valued instantiation).
The diago_david.cpp accidentally contained diag_mixed_precision
function and PrecisionMode dispatch block from the mixed-precision
project. These are now removed; only MPI non-blocking communication
changes remain.
MPI_Iallreduce + immediate MPI_Waitall is equivalent to blocking
MPI_Allreduce but can deadlock in single-process CI. Replace with
direct blocking calls (MPI_Allreduce, MPI_Bcast) which are simpler
and provably correct.
@mohanchen

Copy link
Copy Markdown
Collaborator

This PR presents a really interesting idea. Could you demonstrate that this optimization improves parallel efficiency? You may use the runtime results of benchmark cases for illustration.

@mohanchen mohanchen added Diago Issues related to diagonalizaiton methods project_learning labels May 31, 2026
@laoba657

laoba657 commented May 31, 2026

Copy link
Copy Markdown
Author

非阻塞 MPI 优化的性能测试结果

测试环境

  • CPU: 4 核共享内存
  • MPI: Intel MPI 2021.13
  • 编译器: GCC 11
  • 每个配置重复 50 次,取平均值

实际测试结果

VCC Broadcast(per-band Bcast → 单次 Ibcast)

nband np=1 阻塞 np=1 非阻塞 np=4 阻塞 np=4 非阻塞
64 0.001ms ~0ms 0.087ms 0.122ms
128 0.003ms ~0ms 0.189ms 0.448ms

np=1 时的加速只是消除了空函数调用,没有真正的多进程通信参与。到 np≥2 后,非阻塞版本因为 MPI_Request 分配和进度引擎轮询的额外开销反而变慢了。

Dual Allreduce(串行 Allreduce → 并行 Iallreduce)

nband np=4 阻塞 np=4 非阻塞
64 0.401ms 0.395ms
128 0.818ms 0.850ms
192 1.617ms 1.771ms

结论

在当前单节点共享内存环境下,阻塞 MPI 已经足够快,非阻塞的额外开销反而占主导,通信层面未见明显正向收益。

不过这项改动仍有其价值:

  1. 消除了 diag_zhegvx() 中逐 band 的广播循环,代码逻辑更清晰
  2. MPIRequestTracker 框架为后续实现通信-计算重叠提供了基础
  3. 在真正有网络延迟的多节点集群上,并行发出 Iallreduce 有望实现延迟隐藏

如果需要端到端的加速数据,建议在 InfiniBand 集群上用 tests/performance/ 中的 Si PW 案例做对比测试。

@mohanchen

Copy link
Copy Markdown
Collaborator

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Diago Issues related to diagonalizaiton methods project_learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants