Skip to content

8382482: Remove and reorder instructions in x86 scalar floating point min/max reduction loops#30806

Open
missa-prime wants to merge 5 commits intoopenjdk:masterfrom
missa-prime:user/missa-prime/avx10_2
Open

8382482: Remove and reorder instructions in x86 scalar floating point min/max reduction loops#30806
missa-prime wants to merge 5 commits intoopenjdk:masterfrom
missa-prime:user/missa-prime/avx10_2

Conversation

@missa-prime
Copy link
Copy Markdown
Contributor

@missa-prime missa-prime commented Apr 17, 2026

With AVX10, we can branch on more common cases (e.g., "==") before parity that's currently not possible with previous instruction sets. These changes split the code paths between AVX10 and non-AVX10, so that basic blocks are ordered differently in the scalar floating point min/max reduction rules. Additionally, some unnecessary instructions for zero and non-zero input values in the "==" case are now removed. The JTREG tests listed below were used to verify correctness with the recommended JVM options mentioned in corresponding source files. All modifications and tests used OpenJDK v27-b18 as the baseline build.

  1. jtreg:test/hotspot/jtreg/compiler/igvn/TestMinMaxIdentity.java
  2. jtreg:test/hotspot/jtreg/compiler/intrinsics/math/TestFpMinMaxIntrinsics.java
  3. jtreg:test/hotspot/jtreg/compiler/intrinsics/math/TestFpMinMaxReductions.java

To observe the performance uplift, the benchmarks in the FpMinMaxIntrinsics.java JMH source should be run with identical values in the arrays. The table below contains data captured under these conditions with an Intel® Xeon 6767P. Only the benchmarks affected by the code changes are included. Overall, there is a 20% improvement in the geomean runtime when the changes are applied.

Benchmark Baseline runtime (ns/op) Target runtime (ns/op) Speedup
dMaxReduceGlobalAccumulator 1038.222 796.514 1.30x
dMaxReduceInOuterLoop 3166.615 3130.784 1.01x
dMaxReduceNonCounted 1039.919 796.675 1.31x
dMinReduceGlobalAccumulator 1040.935 796.212 1.31x
dMinReduceInOuterLoop 3094.092 3053.189 1.01x
dMinReduceNonCounted 1039.488 797.242 1.30x
fMaxReduceGlobalAccumulator 1038.396 795.881 1.30x
fMaxReduceInOuterLoop 3183.612 3123.759 1.02x
fMaxReduceNonCounted 1039.947 797.281 1.30x
fMinReduceGlobalAccumulator 1040.677 795.979 1.31x
fMinReduceInOuterLoop 3138.846 3113.044 1.01x
fMinReduceNonCounted 1040.162 797.090 1.30x
Geomean 1503.76 1253.678 1.20x

It's important to note that performance doesn't regress with more varied data though. The changes in this PR update the array values in FpMinMaxIntrinsics.java to include random and structured patterns. Specifically, 50% are random, 20% are zeroes, 10% are descending, 10% are ascending, and 10% are NaNs. The entries are interspersed throughout the arrays in uniform fashion. The table below shows results collected with this new scheme. Overall, the geomean runtime remains flat when the changes are applied.

Benchmark Baseline runtime (ns/op) Target runtime (ns/op) Speedup
dMaxReduceGlobalAccumulator 668.804 668.761 1.00x
dMaxReduceInOuterLoop 3036.979 2987.393 1.02x
dMaxReduceNonCounted 673.008 672.938 1.00x
dMinReduceGlobalAccumulator 668.403 668.937 1.00x
dMinReduceInOuterLoop 2986.121 2987.771 1.00x
dMinReduceNonCounted 673.293 672.864 1.00x
fMaxReduceGlobalAccumulator 668.324 668.465 1.00x
fMaxReduceInOuterLoop 3141.699 3138.469 1.00x
fMaxReduceNonCounted 669.225 669.407 1.00x
fMinReduceGlobalAccumulator 668.584 668.199 1.00x
fMinReduceInOuterLoop 3139.533 3114.516 1.01x
fMinReduceNonCounted 672.990 672.786 1.00x
Geomean 1113.838 1111.488 1.00x


Progress

  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue
  • Change must be properly reviewed (2 reviews required, with at least 1 Reviewer, 1 Author)

Issue

  • JDK-8382482: Remove and reorder instructions in x86 scalar floating point min/max reduction loops (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/30806/head:pull/30806
$ git checkout pull/30806

Update a local copy of the PR:
$ git checkout pull/30806
$ git pull https://git.openjdk.org/jdk.git pull/30806/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 30806

View PR using the GUI difftool:
$ git pr show -t 30806

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/30806.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link
Copy Markdown

bridgekeeper Bot commented Apr 17, 2026

👋 Welcome back missa! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 17, 2026

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk Bot added the hotspot-compiler hotspot-compiler-dev@openjdk.org label Apr 17, 2026
@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 17, 2026

@missa-prime The following label will be automatically applied to this pull request:

  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command.

@openjdk
Copy link
Copy Markdown

openjdk Bot commented Apr 17, 2026

The total number of required reviews for this PR has been set to 2 based on the presence of this label: hotspot-compiler. This can be overridden with the /reviewers command.

@missa-prime missa-prime marked this pull request as ready for review April 18, 2026 00:14
@openjdk openjdk Bot added the rfr Pull request is ready for review label Apr 18, 2026
@mlbridge
Copy link
Copy Markdown

mlbridge Bot commented Apr 18, 2026

Webrevs

Copy link
Copy Markdown
Contributor

@galderz galderz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #30844, I'm expanding the MinMaxVector benchmark to test fp values. Some of the benchmarks target reduction use cases. Maybe you could run the fp benchmarks there and see what impact your changes have on them when using avx10? IIRC you might have to disable superword to get the branching assembly to kick in.

@missa-prime
Copy link
Copy Markdown
Contributor Author

In #30844, I'm expanding the MinMaxVector benchmark to test fp values. Some of the benchmarks target reduction use cases. Maybe you could run the fp benchmarks there and see what impact your changes have on them when using avx10? IIRC you might have to disable superword to get the branching assembly to kick in.

I ran MinMaxVector with default benchmark parameters using -XX:-UseSuperWord VM argument on builds with and without the changes. Overall, I don't see much change in performance apart from some outliers in the non-AVX10 and AVX10 paths. I believe most of the time we'll get -XX:+UseSuperWord and loop auto-vectorization will work as expected, so these scalar floating point reduction min/max rules won't be used very often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

2 participants