Skip to content

Refine complex buffer copies and add round-trip tests for module_pw#7412

Merged
Qianruipku merged 11 commits into
deepmodeling:developfrom
mystic-qaq:feat/simd
Jun 16, 2026
Merged

Refine complex buffer copies and add round-trip tests for module_pw#7412
Qianruipku merged 11 commits into
deepmodeling:developfrom
mystic-qaq:feat/simd

Conversation

@Aunixt

@Aunixt Aunixt commented May 30, 2026

Copy link
Copy Markdown

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added

What's changed?

This pull request refactors the copying of complex buffers in the plane-wave basis module to use new helper functions that improve performance and code clarity, especially with respect to vectorization and parallelization. It also adds comprehensive round-trip tests for complex-to-complex plane-wave transforms to ensure correctness. The most important changes are summarized below.

Performance and Code Quality Improvements

  • Introduced detail::copy_complex_buffer and detail::copy_complex_buffer_parallel helper functions in pw_gatherscatter.h to efficiently copy complex buffers, enabling better compiler vectorization and OpenMP parallelization. All manual loops copying complex arrays have been replaced with calls to these helpers.
  • Refactored all relevant methods in pw_gatherscatter.h, pw_transform.cpp, and pw_transform_k.cpp to use the new helper functions for copying complex buffers, replacing explicit for-loops and OpenMP pragmas with the new, more maintainable approach.

Testing and Validation

  • Added new round-trip tests for both PW_Basis and PW_Basis_K classes to verify that a reciprocal-to-real followed by a real-to-reciprocal transform accurately recovers the original complex data, ensuring the correctness of the new buffer copy implementation.

These changes collectively improve the performance, maintainability, and reliability of the plane-wave basis transformation routines.

Any changes of core modules? (ignore if not applicable)

  • Example: I have added a new virtual function in the esolver base class in order to ...

Copilot AI review requested due to automatic review settings May 30, 2026 16:54

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Refactors repeated complex-buffer copy loops in the plane-wave transform code into shared helper templates (detail::copy_complex_buffer and detail::copy_complex_buffer_parallel) that operate on the underlying scalar stream to aid vectorization. Adds round-trip serial unit tests for PW_Basis and PW_Basis_K.

Changes:

  • Introduced detail::copy_complex_buffer[_parallel] helpers in pw_gatherscatter.h using __restrict__ and ivdep hints over the interleaved real/imag scalar stream.
  • Replaced inline copy loops in pw_transform.cpp, pw_transform_k.cpp, and gather/scatter routines with calls to the new helpers.
  • Added ComplexTransformRoundTrip serial tests for PW_Basis and PW_Basis_K.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
source/source_basis/module_pw/pw_gatherscatter.h Adds shared copy helpers and refactors gather/scatter loops to use them.
source/source_basis/module_pw/pw_transform.cpp Replaces manual OpenMP copy loops with copy_complex_buffer_parallel.
source/source_basis/module_pw/pw_transform_k.cpp Same refactor applied to k-point variant.
source/source_basis/module_pw/test_serial/pw_basis_test.cpp New round-trip test for recip2real/real2recip.
source/source_basis/module_pw/test_serial/pw_basis_k_test.cpp New round-trip test for the k-point variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread source/source_basis/module_pw/pw_gatherscatter.h Outdated
Comment thread source/source_basis/module_pw/pw_gatherscatter.h Outdated
Comment thread source/source_basis/module_pw/pw_gatherscatter.h
Comment thread source/source_basis/module_pw/test_serial/pw_basis_test.cpp
Comment thread source/source_basis/module_pw/test_serial/pw_basis_k_test.cpp
@mohanchen

Copy link
Copy Markdown
Collaborator

This refactoring idea is quite interesting. Feel free to move forward with your implementation.

@mohanchen mohanchen requested a review from Qianruipku June 9, 2026 05:40
@mohanchen

Copy link
Copy Markdown
Collaborator

Reply to the comments if you have solved them or known how to answer.

@Qianruipku Qianruipku left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide a comparison of the execution time before and after optimization.

Comment thread work_docs/feat_simd_optimization_process_report_2026-06-06.md Outdated
@Aunixt

Aunixt commented Jun 15, 2026

Copy link
Copy Markdown
Author

My original work on feat/simd targeted the data reordering/copy path around the module_pw FFT transforms, mainly the real/reciprocal buffer copies and gather/scatter helpers used by PW_Basis and PW_Basis_K. This overlaps strongly with commit f4af81009368cba317a2948114edef8c1eedf940, which has already landed on develop:

f4af81009 perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori... (#7432)

That commit is already included in develop, and it is also included in the current feat/simd branch after merging develop. It changed:

source/source_basis/module_pw/pw_gatherscatter.h
source/source_basis/module_pw/pw_transform.cpp
source/source_hamilt/operator.h

The module_pw part of that commit covers one of the same hot paths this PR originally targeted. As a result, the current comparison against latest develop already contains a similar round of pw_basis FFT data reordering / memcpy / SIMD-vectorization work.

Compared with f4af81009, this PR is more complete in coverage and engineering structure. The main differences are:

  • This PR introduces ModulePW::detail::copy_complex_buffer() as a shared helper for std::complex<T> buffer copies, instead of repeating raw memcpy(..., count * sizeof(std::complex<T>)) calls in multiple places.
  • It also adds copy_complex_buffer_parallel() for top-level transform buffer copies. The helper only chunks and parallelizes larger buffers; small buffers still use the regular helper to avoid extra parallel overhead. The benchmark below uses USE_OPENMP=OFF, so this potential OpenMP benefit is not included in the reported numbers.
  • The PR covers not only PW_Basis gather/scatter helpers, but also the nrxx-sized complex buffer copies in PW_Basis_K::real2recip() and PW_Basis_K::recip2real(). This PW_Basis_K path was not covered by f4af81009.
  • It adds round-trip correctness tests, so the PW_Basis / PW_Basis_K transform behavior and performance can be checked directly.

Benchmark setup:

Role Revision Notes
Baseline develop@420f1ad00 Same benchmark harness was added so both branches build the same test executable.
Optimized feat/simd@dba326d99 Current PR branch, after merging latest develop.

Environment and build options:

CPU: 13th Gen Intel(R) Core(TM) i5-13500H
OS: WSL2, Linux 6.6.87.2-microsoft-standard-WSL2
Compiler: GCC 13.3.0
CMake: Release, ENABLE_MPI=OFF, USE_OPENMP=OFF
Runtime: OMP_NUM_THREADS=1 OPENBLAS_NUM_THREADS=1 taskset -c 0
Runs: develop / feat-simd alternated, 9 runs each, reporting median and IQR

Results are in ms/op :

Benchmark develop median feat/simd median speedup Change
PW_Basis.medium.roundtrip 0.006101 0.006746 0.904x 10.6% slower
PW_Basis.large.roundtrip 0.010797 0.012378 0.872x 14.6% slower
PW_Basis.xlarge.roundtrip 3.192069 3.086157 1.034x 3.3% faster
PW_Basis_K.medium.roundtrip 0.004276 0.003731 1.146x 12.7% faster
PW_Basis_K.xlarge.roundtrip 3.165223 2.764272 1.145x 12.7% faster

Workload sizes:

Benchmark nrxx npw repeats
PW_Basis.medium.roundtrip 320 49 4096
PW_Basis.large.roundtrip 729 147 2048
PW_Basis.xlarge.roundtrip 46656 11837 512
PW_Basis_K.medium.roundtrip 320 49 4096
PW_Basis_K.xlarge.roundtrip 46656 11897 512

The smaller cases are very sensitive to fixed overhead and scheduling noise. The more meaningful data point here is the larger synthetic workload. In particular, PW_Basis_K.xlarge shows about 1.15x median speedup. Compared with f4af81009 , this PR also optimizes the nrxx-sized complex buffer copies in PW_Basis_K::real2recip() and PW_Basis_K::recip2real().

@Qianruipku Qianruipku merged commit 128d8d8 into deepmodeling:develop Jun 16, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants