Refine complex buffer copies and add round-trip tests for module_pw by Aunixt · Pull Request #7412 · deepmodeling/abacus-develop

Aunixt · 2026-05-30T16:54:14Z

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

A unit test is added

What's changed?

This pull request refactors the copying of complex buffers in the plane-wave basis module to use new helper functions that improve performance and code clarity, especially with respect to vectorization and parallelization. It also adds comprehensive round-trip tests for complex-to-complex plane-wave transforms to ensure correctness. The most important changes are summarized below.

Performance and Code Quality Improvements

Introduced detail::copy_complex_buffer and detail::copy_complex_buffer_parallel helper functions in pw_gatherscatter.h to efficiently copy complex buffers, enabling better compiler vectorization and OpenMP parallelization. All manual loops copying complex arrays have been replaced with calls to these helpers.
Refactored all relevant methods in pw_gatherscatter.h, pw_transform.cpp, and pw_transform_k.cpp to use the new helper functions for copying complex buffers, replacing explicit for-loops and OpenMP pragmas with the new, more maintainable approach.

Testing and Validation

Added new round-trip tests for both PW_Basis and PW_Basis_K classes to verify that a reciprocal-to-real followed by a real-to-reciprocal transform accurately recovers the original complex data, ensuring the correctness of the new buffer copy implementation.

These changes collectively improve the performance, maintainability, and reliability of the plane-wave basis transformation routines.

Any changes of core modules? (ignore if not applicable)

Example: I have added a new virtual function in the esolver base class in order to ...

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Refactors repeated complex-buffer copy loops in the plane-wave transform code into shared helper templates (detail::copy_complex_buffer and detail::copy_complex_buffer_parallel) that operate on the underlying scalar stream to aid vectorization. Adds round-trip serial unit tests for PW_Basis and PW_Basis_K.

Changes:

Introduced detail::copy_complex_buffer[_parallel] helpers in pw_gatherscatter.h using __restrict__ and ivdep hints over the interleaved real/imag scalar stream.
Replaced inline copy loops in pw_transform.cpp, pw_transform_k.cpp, and gather/scatter routines with calls to the new helpers.
Added ComplexTransformRoundTrip serial tests for PW_Basis and PW_Basis_K.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
source/source_basis/module_pw/pw_gatherscatter.h	Adds shared copy helpers and refactors gather/scatter loops to use them.
source/source_basis/module_pw/pw_transform.cpp	Replaces manual OpenMP copy loops with `copy_complex_buffer_parallel`.
source/source_basis/module_pw/pw_transform_k.cpp	Same refactor applied to k-point variant.
source/source_basis/module_pw/test_serial/pw_basis_test.cpp	New round-trip test for `recip2real`/`real2recip`.
source/source_basis/module_pw/test_serial/pw_basis_k_test.cpp	New round-trip test for the k-point variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mohanchen · 2026-05-31T01:05:59Z

This refactoring idea is quite interesting. Feel free to move forward with your implementation.

mohanchen · 2026-06-09T05:41:46Z

Reply to the comments if you have solved them or known how to answer.

Qianruipku

Please provide a comparison of the execution time before and after optimization.

Aunixt · 2026-06-15T13:49:03Z

My original work on feat/simd targeted the data reordering/copy path around the module_pw FFT transforms, mainly the real/reciprocal buffer copies and gather/scatter helpers used by PW_Basis and PW_Basis_K. This overlaps strongly with commit f4af81009368cba317a2948114edef8c1eedf940, which has already landed on develop:

f4af81009 perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori... (#7432)

That commit is already included in develop, and it is also included in the current feat/simd branch after merging develop. It changed:

source/source_basis/module_pw/pw_gatherscatter.h
source/source_basis/module_pw/pw_transform.cpp
source/source_hamilt/operator.h

The module_pw part of that commit covers one of the same hot paths this PR originally targeted. As a result, the current comparison against latest develop already contains a similar round of pw_basis FFT data reordering / memcpy / SIMD-vectorization work.

Compared with f4af81009, this PR is more complete in coverage and engineering structure. The main differences are:

This PR introduces ModulePW::detail::copy_complex_buffer() as a shared helper for std::complex<T> buffer copies, instead of repeating raw memcpy(..., count * sizeof(std::complex<T>)) calls in multiple places.
It also adds copy_complex_buffer_parallel() for top-level transform buffer copies. The helper only chunks and parallelizes larger buffers; small buffers still use the regular helper to avoid extra parallel overhead. The benchmark below uses USE_OPENMP=OFF, so this potential OpenMP benefit is not included in the reported numbers.
The PR covers not only PW_Basis gather/scatter helpers, but also the nrxx-sized complex buffer copies in PW_Basis_K::real2recip() and PW_Basis_K::recip2real(). This PW_Basis_K path was not covered by f4af81009.
It adds round-trip correctness tests, so the PW_Basis / PW_Basis_K transform behavior and performance can be checked directly.

Benchmark setup:

Role	Revision	Notes
Baseline	`develop@420f1ad00`	Same benchmark harness was added so both branches build the same test executable.
Optimized	`feat/simd@dba326d99`	Current PR branch, after merging latest `develop`.

Environment and build options:

CPU: 13th Gen Intel(R) Core(TM) i5-13500H
OS: WSL2, Linux 6.6.87.2-microsoft-standard-WSL2
Compiler: GCC 13.3.0
CMake: Release, ENABLE_MPI=OFF, USE_OPENMP=OFF
Runtime: OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 taskset -c 0
Runs: develop / feat-simd alternated, 9 runs each, reporting median and IQR

Results are in ms/op :

Benchmark	develop median	feat/simd median	speedup	Change
`PW_Basis.medium.roundtrip`	0.006101	0.006746	0.904x	10.6% slower
`PW_Basis.large.roundtrip`	0.010797	0.012378	0.872x	14.6% slower
`PW_Basis.xlarge.roundtrip`	3.192069	3.086157	1.034x	3.3% faster
`PW_Basis_K.medium.roundtrip`	0.004276	0.003731	1.146x	12.7% faster
`PW_Basis_K.xlarge.roundtrip`	3.165223	2.764272	1.145x	12.7% faster

Workload sizes:

Benchmark	`nrxx`	`npw`	repeats
`PW_Basis.medium.roundtrip`	320	49	4096
`PW_Basis.large.roundtrip`	729	147	2048
`PW_Basis.xlarge.roundtrip`	46656	11837	512
`PW_Basis_K.medium.roundtrip`	320	49	4096
`PW_Basis_K.xlarge.roundtrip`	46656	11897	512

The smaller cases are very sensitive to fixed overhead and scheduling noise. The more meaningful data point here is the larger synthetic workload. In particular, PW_Basis_K.xlarge shows about 1.15x median speedup. Compared with f4af81009 , this PR also optimizes the nrxx-sized complex buffer copies in PW_Basis_K::real2recip() and PW_Basis_K::recip2real().

Aunixt added 4 commits May 31, 2026 00:24

have a try

7c58a45

refine complex buffer copies in module_pw

c268969

add module_pw complex transform round-trip tests

754fe85

document module_pw copy helpers and tests

25ebe2e

Copilot AI review requested due to automatic review settings May 30, 2026 16:54

Copilot AI reviewed May 30, 2026

View reviewed changes

mohanchen added the project_learning label May 31, 2026

Aunixt added 3 commits June 4, 2026 20:16

Merge branch 'deepmodeling:develop' into feat/simd

f3a0b6b

remove pragma GCC ivdep and use std::copy_n

3245d2d

add test for simd

d9d84e7

mohanchen requested a review from Qianruipku June 9, 2026 05:40

Qianruipku requested changes Jun 11, 2026

View reviewed changes

Comment thread work_docs/feat_simd_optimization_process_report_2026-06-06.md Outdated

Aunixt added 4 commits June 15, 2026 19:35

remove work_docs

be693c5

merge develop into feat/simd

dba326d

remove pw_simd_bench.cpp

7032d2c

build: remove SIMD benchmark target

75135df

Qianruipku approved these changes Jun 16, 2026

View reviewed changes

Qianruipku merged commit 128d8d8 into deepmodeling:develop Jun 16, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine complex buffer copies and add round-trip tests for module_pw#7412

Refine complex buffer copies and add round-trip tests for module_pw#7412
Qianruipku merged 11 commits into
deepmodeling:developfrom
mystic-qaq:feat/simd

Aunixt commented May 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohanchen commented May 31, 2026

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

Qianruipku left a comment

Uh oh!

Uh oh!

Aunixt commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Aunixt commented May 30, 2026

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Performance and Code Quality Improvements

Testing and Validation

Any changes of core modules? (ignore if not applicable)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mohanchen commented May 31, 2026

Uh oh!

mohanchen commented Jun 9, 2026

Uh oh!

Qianruipku left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Aunixt commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants