Skip to content

[optimize] PrecedingFilter: K-bounded sliding window for preceding::*[K] (#2129 follow-up to PR #6325)#6330

Open
joewiz wants to merge 3 commits into
eXist-db:developfrom
joewiz:perf/2129-preceding-axis-residual
Open

[optimize] PrecedingFilter: K-bounded sliding window for preceding::*[K] (#2129 follow-up to PR #6325)#6330
joewiz wants to merge 3 commits into
eXist-db:developfrom
joewiz:perf/2129-preceding-axis-residual

Conversation

@joewiz
Copy link
Copy Markdown
Member

@joewiz joewiz commented May 10, 2026

Summary

Damps the position-dependence of $node/preceding::*[K] by switching LocationStep.PrecedingFilter to a K-bounded sliding window when a positional predicate is present. Companion to PR #6325 (which closed the following::* half of #2129); this PR closes the more constrained preceding::* half.

Root cause

The wildcard preceding-axis path in LocationStep.getPrecedingOrFollowing walks an IEmbeddedXMLStreamReader from the document root and applies PrecedingFilter to collect every match into the result NodeSet, then defers [K] to the predicate machinery. On a 50,000-element flat document at @xml:id=45000, that meant ~45,000 NodeProxy allocations, ~45,000 result.add() calls, and an O(N log N) sort downstream — even though [5] could only ever select 5 of them. craigberry reported this as a 12 s page-turn impact in an 1100-page book (#2129 comment).

Unlike following::* (PR #6325, which simply repositions the StAX reader to start at the reference node), the preceding::* walk cannot be skipped: matches must be emitted before the reference, and the reader is forward-only.

What changed

exist-core/src/main/java/org/exist/xquery/LocationStep.java

  • PrecedingFilter now accepts a limit parameter (mirroring FollowingFilter's constructor), passed from getPrecedingOrFollowing via the existing computeLimit() extraction (no separate optimizer pass needed — checkPositionalFilters already gates the optimization to integer-literal positional predicates with no CONTEXT_POSITION dependency).
  • When limit > 0, the filter maintains an ArrayDeque<NodeProxy> of size K. New matches enter at the tail; the oldest is evicted at the head when capacity is reached. The window flushes into the result NodeSet on filter termination (reference node reached or root END_ELEMENT).
  • Why a sliding window instead of FollowingFilter's "stop after K" pattern: preceding-axis positional [K] selects the K-th match in axis order = (K-th-from-end) in document order. We have to keep walking past every match to know which K are the most recent.

exist-core/src/test/xquery/optimizer/positional.xqm

  • ot:optimize-simple-preceding previously asserted the absence of the POSITIONAL_PREDICATE optimization on preceding::*[1], documenting the prior gap. The assertion is flipped to expect the optimization, mirroring the existing ot:optimize-simple-following-nested case at line 170.

exist-core/src/test/java/org/exist/xquery/PrecedingAxisPositionRegressionTest.java (new)

Mirrors the structure of FollowingAxisPositionRegressionTest (PR #6325):

  • correctness at 3 reference positions (early xml:id=10, mid 25, late 45)
  • ancestor exclusion on a nested document
  • K=1..4 axis-order semantics (k-th preceding * from w[5] is w[5-k])
  • position-independence threshold (3× / 500 ms floor) on a 50,000-element flat document
  • wildcard-vs-preceding-sibling::w ratio comparison

Performance impact

50,000-element flat document, 5-trial median per data point, A/B comparison via runtime kill switch on the K-bounded path:

metric before fix after fix
lateMs / earlyMs ratio (xml:id=45000 vs 5000) ~2.55× ~2.02×
wildcard-vs-preceding-sibling:: ratio ~1.75× ~1.52×

The wildcard ratio is closer to but not yet at craigberry's reported ~1.33× sibling baseline. The StAX walk itself remains O(refPosition) since matches must be emitted before the reference and the reader is forward-only, so absolute time still grows with position. Eliminating that would require a different approach (e.g., backward navigation through the persistent NodeId structure for wildcard tests). The K-bounded buffer is a clean, conservative win on the allocation/sort axis and a prerequisite for any later walk-avoidance work.

Test plan

  • PrecedingAxisPositionRegressionTest — 7 tests, all green
  • xquery.optimizer.OptimizerTests — 60/60 pass after positional.xqm assertion flip
  • mvn test -pl exist-core6599 pass, 0 failures, 0 errors, 106 pre-existing skips
  • Codacy PMD on LocationStep.java — no new warnings (existing NPathComplexity on getPrecedingOrFollowing unchanged)
  • A/B perf measurement via runtime kill switch confirms K-bounded path fires for [K] literals and [K + 1 - $i] FLWOR-bound expressions

Closes

Partially addresses #2129. Full closure of the wildcard-vs-sibling gap (down to ~1.0× parity) requires walk-avoidance work left as a follow-up.

@joewiz joewiz requested a review from a team as a code owner May 10, 2026 04:23
joewiz added a commit to joewiz/exist that referenced this pull request May 12, 2026
…eview

Address duncdrum's review on PR eXist-db#6330 by splitting the mixed-purpose
PrecedingAxisPositionRegressionTest.java into two artifacts:

- exist-core-jmh/.../PrecedingAxisBenchmark.java: JMH benchmark for the
  performance comparison (wildcard preceding::* vs preceding-sibling::, at
  early/mid/late positions on a 50,000-element flat doc). JMH handles
  statistical aggregation natively; the bespoke nanoTime + median-of-N
  infrastructure is dropped.

- exist-core/src/test/xquery/preceding-axis.xql: XQSuite tests for the
  correctness assertions (early/mid/late reproducer output, ancestor
  exclusion on the preceding axis, axis-order positional predicate
  semantics).

The original JUnit class is removed, which also resolves the line-66
unused-variable Codacy complaint (SMALL_DOC) by deletion.

Full-module gate (per strengthened test-before-push SOP):
Tests run: 6597, Failures: 0, Errors: 0, Skipped: 106, BUILD SUCCESS.
JMH module builds clean.
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented May 12, 2026

[This response was co-authored with Claude Code. -Joe]

Done in dec9172. Split per your review:

  • Performance measurement → exist-core-jmh/src/main/java/org/exist/xquery/PrecedingAxisBenchmark.java (JMH, with @Param'd reference position for early/mid/late, and a preceding-sibling:: baseline for relative interpretation).
  • Correctness assertions → exist-core/src/test/xquery/preceding-axis.xql (XQSuite: reproducer at early/mid/late positions, ancestor exclusion on the preceding axis, axis-order positional predicate semantics).

The original mixed-purpose JUnit class is removed; that also resolves the line-66 SMALL_DOC unused-variable Codacy complaint by deletion.

Full-module gate: Tests run: 6597, Failures: 0, Errors: 0, Skipped: 106, BUILD SUCCESS. JMH module compiles clean.

Copy link
Copy Markdown
Contributor

@duncdrum duncdrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, needs a rebase

joewiz and others added 2 commits May 12, 2026 12:56
Wildcard `preceding::*[K]` previously accumulated every preceding match
into the result NodeSet between document start and the reference node,
then let the predicate machinery pick the K-th. On a 50,000-element flat
document at @xml:id=45000, that meant ~45,000 NodeProxy allocations,
~45,000 result.add() calls, and an O(N log N) sort downstream, even
though only 5 elements could ever be selected by `[5]`.

The fix: when LocationStep.computeLimit() yields a positive K (the
existing positional-predicate detection used by FollowingFilter),
PrecedingFilter switches to a K-bounded sliding window. Matches are
buffered in an ArrayDeque sized to K, with the oldest evicted as new
ones arrive. The window flushes into the result NodeSet on filter
termination (reference node reached or root END_ELEMENT).

Why a sliding window instead of "stop after K" (the FollowingFilter
shape from PR eXist-db#6325): preceding-axis positional `[K]` selects the K-th
match in axis order = (K-th-from-end) in document order. We have to
keep walking past every match to know which K are the most recent.

Performance impact (50,000-element flat doc, 5 trials median):
- ratio lateMs/earlyMs: ~2.55x -> ~2.02x (position-dependence damped)
- wildcard-vs-sibling gap: ~1.75x -> ~1.52x (closer to craigberry's
  reported ~1.33x sibling baseline, not yet at parity)

The StAX walk itself remains O(refPosition) since matches must be
emitted before the reference and the reader is forward-only, so
absolute time still grows with position. Eliminating that would
require a different approach (e.g., backward navigation through the
persistent NodeId structure for wildcard tests). The K-bounded buffer
is a clean, conservative win on the allocation/sort axis and a
prerequisite for any later walk-avoidance work.

Tests:
 - PrecedingAxisPositionRegressionTest mirrors PR eXist-db#6325's
   FollowingAxisPositionRegressionTest: correctness at 3 reference
   positions (early, mid, late), ancestor exclusion, K=1..4 axis-order
   semantics, position-independence threshold, and a
   wildcard-vs-preceding-sibling comparison.
 - positional.xqm:180 `optimize-simple-preceding` documented the prior
   gap (no POSITIONAL_PREDICATE optimization on preceding axis); the
   assertion is flipped to expect the optimization, mirroring the
   existing `optimize-simple-following-nested` case at line 170.
 - exist-core suite: 6599 tests, 0 failures, 0 errors (106 pre-existing
   skips).

Partially addresses eXist-db#2129 (the
preceding-axis half; following-axis half is closed by PR eXist-db#6325). Full
closure of the sibling-vs-wildcard gap requires walk-avoidance, left as
follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eview

Address duncdrum's review on PR eXist-db#6330 by splitting the mixed-purpose
PrecedingAxisPositionRegressionTest.java into two artifacts:

- exist-core-jmh/.../PrecedingAxisBenchmark.java: JMH benchmark for the
  performance comparison (wildcard preceding::* vs preceding-sibling::, at
  early/mid/late positions on a 50,000-element flat doc). JMH handles
  statistical aggregation natively; the bespoke nanoTime + median-of-N
  infrastructure is dropped.

- exist-core/src/test/xquery/preceding-axis.xql: XQSuite tests for the
  correctness assertions (early/mid/late reproducer output, ancestor
  exclusion on the preceding axis, axis-order positional predicate
  semantics).

The original JUnit class is removed, which also resolves the line-66
unused-variable Codacy complaint (SMALL_DOC) by deletion.

Full-module gate (per strengthened test-before-push SOP):
Tests run: 6597, Failures: 0, Errors: 0, Skipped: 106, BUILD SUCCESS.
JMH module builds clean.
@joewiz joewiz force-pushed the perf/2129-preceding-axis-residual branch from dec9172 to a1fe463 Compare May 12, 2026 16:58
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented May 12, 2026

[This response was co-authored with Claude Code. -Joe]

Rebased on current develop tip — clean except for one conflict in exist-core-jmh/pom.xml (your recent exist-index-lucene + lucene-core deps vs. our xmldb-api add for the new benchmark). Merged by keeping all three. New tip: a1fe4631d6. CI re-runs.

@line-o line-o added xquery issue is related to xquery implementation performance bottlenecks, opportunities for rewriting, optimization labels May 12, 2026
@duncdrum
Copy link
Copy Markdown
Contributor

1 Java doc error, should be easy to spot locally. A bit hard to see in the mvnd reactor log on ci.

…n CI

Addresses duncdrum's CI Javadoc error on PR eXist-db#6330:
- Replace {@link LocationStep.PrecedingFilter} (unresolved reference)
  with {@code LocationStep.PrecedingFilter}.
- Add Javadoc for the default constructor.
- Add Javadoc with @throws to setUp and tearDown.
- Add @return and @throws to wildcardPrecedingWithPositionalPredicate
  and precedingSiblingBaseline.

The JMH-generated annotation classes (under target/generated-sources/)
also trip the doclint, but they're regenerated on every build — this
commit doesn't touch them; if their lint complaints block CI, the proper
fix is a javadoc-plugin excludePackageNames for org.exist.xquery.jmh_generated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joewiz
Copy link
Copy Markdown
Member Author

joewiz commented May 12, 2026

[This response was co-authored with Claude Code. -Joe]

Found it — 9f27c409a3 adds the missing Javadoc to PrecedingAxisBenchmark.java:

  • {@link LocationStep.PrecedingFilter}{@code LocationStep.PrecedingFilter} (the link reference wasn't resolvable from the JMH module's classpath)
  • Default constructor Javadoc added
  • @throws added to @Setup / @TearDown methods
  • @return and @throws added to the two @Benchmark methods

The CI log also flagged the JMH-generated annotation classes under target/generated-sources/annotations/org/exist/xquery/jmh_generated/ for "no comment" warnings. Those are regenerated every build — if doclint still treats them as errors after this push, the right fix is an excludePackageNames for org.exist.xquery.jmh_generated in the javadoc-plugin config (the sibling benchmarks under exist-core-jmh/ would presumably need the same exclusion, suggesting it's a missing module-level config). Happy to add that as a follow-up commit if CI still flags them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance bottlenecks, opportunities for rewriting, optimization xquery issue is related to xquery implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants