Api: Support variant extract and fix manifest bounds byte order by qlong · Pull Request #15384 · apache/iceberg

qlong · 2026-02-20T18:18:29Z

Changes:

Add support for bracket style and mixed style of jsonpath, like $.user.name, or $['user']['name]. This change is needed to support dot or special characters in field names
Handle UnboundExtract/BoundExtract in ExpressionUtil.describe and unbind
Variant bounds fix: Use little-endian ByteBuffer in InclusiveMetricsEvaluator.parseBounds for variant bounds

Variant bound fix is required manifest-based file skipping for variant columns. The
ExpressionUtil and PathUtil changes allow variant extract terms to be
described and unbound for EXPLAIN and sanitization.

Testing:

unit tests

Related Issues:

Variant Data Type Support #10392

deniskuzZ · 2026-02-23T14:41:42Z

      return ((BoundReference<?>) term).name();
+    } else if (term instanceof UnboundExtract) {
+      UnboundExtract<?> unboundExtract = (UnboundExtract<?>) term;
+      return "extract("


deniskuzZ · 2026-02-23T14:45:53Z

  private static VariantObject parseBounds(ByteBuffer buffer) {
-    return Variant.from(buffer).value().asObject();
+    // Explicitly use little-endian encoding for reading buffer
+    ByteBuffer littleEndian = buffer.duplicate().order(ByteOrder.LITTLE_ENDIAN);


don't we expect ByteBuffer for variant already with little-endian encoding?

You are right that variant bounds are already stored as little-endian. However, ByteBuffer.wrap() in Java defaults to big-endian byte order, hence a mismatch when reading the manifest bound bytes. ParquetVariantReaders.readBinary has the same issue and also sets .order(ByteOrder.LITTLE_ENDIAN).

but it doesn't allocate new buffer (buffer.duplicate()), should we?

I think we need to create a new buffer so the the order(ByteOrder.LITTLE_ENDIAN) call does not change the encoding of the buffer that is passed in. ParquetVariantReaders.readBinary creates a fresh buffer

In my opinion, we should store the variant buffer in lowerBounds/upperBounds using the proper encoding, rather than applying the encoding later in parseBounds.

At the same time i can see similar pattern used in a project

ByteBuffer tmp = buffer.duplicate(); tmp.order(ByteOrder.LITTLE_ENDIAN);

I've just discovered today that Arm64 defaults to being little endian too, so even if the java default is from the SPARC era, little endian is what we need for performance on x86 and raspberry pis (oh, and macbooks, AWS Graviton parts...)

deniskuzZ · 2026-02-23T14:55:17Z

+      new String[][] {
+        new String[] {"$", "$"},
+        new String[] {"$['a']", "$.a"},
+        new String[] {"$['a']['b']", "$.a.b"},


does it cover structs? shredded variant inside struct or shredded struct field in variant?

Yes, $['a']['b'] → $.a.b and $['a']['b']['c'] → $.a.b.c cover paths to nested variant object fields, which correspond to shredded struct fields within a variant. The variant extract path (e.g. $['address']['city']) is independent of where the variant column itself lives in the Iceberg schema — field navigation to the variant column is handled separately by BoundReference.

deniskuzZ

+1

huaxingao · 2026-02-25T05:19:24Z

cc @aihuaxu

nssalian

Thanks for the PR. Left a couple of comments.

nssalian · 2026-03-30T20:22:24Z

+        "Invalid normalized path: %s",
+        normalizedPath);
+    List<String> fields = Lists.newArrayList();
+    Matcher matcher = Pattern.compile("\\['([^']*)'\\]").matcher(normalizedPath);


Could you add this Pattern to a static field at the top like other Patterns?

nssalian · 2026-03-30T20:35:08Z

+    if (fields.isEmpty()) {
+      return ROOT;
+    }
+    return ROOT + "." + String.join(".", fields);


Can a field name ever contain a dot? If so, toDotNotation("$['user.name']") breaks the round-trip. PathUtil.parse() would split it into two fields. If that's possible, might be simpler to have unbind() pass the normalized path directly instead of converting back to dot notation.

Good call out. json field name can have dot, and you are right toDotNotation("$['user.name']") breaks the round trip, as dot style path itself cannot represent "user.name" as a single segment.

Digging code more, iceberg only supports dot style path for variant (https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/PathUtil.java#L55-L56), internally it converts to bracket style (#12835). I think this is pretty strong limitation as it would not support dot in field name. Spark supports bracket, dot, and mixed styple path for variant, so $.employee['user.name'] works. I think we should support those styles as well. I can add that to this PR as the current change is small.

added support of bracket style path, so we can support $['user.name']. Got rid of toDotNotation completely as suggested .

nssalian

A couple of comments but overall this is directionally good. @huaxingao @aihuaxu @steveloughran PTAL

steveloughran

commented,
production code: comments
tests: code changes.

steveloughran · 2026-04-14T20:08:10Z

  private static VariantObject parseBounds(ByteBuffer buffer) {
-    return Variant.from(buffer).value().asObject();
+    // Explicitly use little-endian encoding for reading buffer
+    ByteBuffer littleEndian = buffer.duplicate().order(ByteOrder.LITTLE_ENDIAN);


I've just discovered today that Arm64 defaults to being little endian too, so even if the java default is from the SPARC era, little endian is what we need for performance on x86 and raspberry pis (oh, and macbooks, AWS Graviton parts...)

steveloughran · 2026-04-14T20:10:28Z

+   * One step in a variant JSONPath: an object member name or a zero-based array index (RFC 9535
+   * {@code [n]} selector).
+   */
+  sealed interface PathSegment permits PathSegment.Name, PathSegment.Index {


I'm going to have to play with these java17 features; reminds me a lot of standard ML etc. We don't get the full switching on them until Java21, do we?

steveloughran · 2026-04-14T20:27:25Z

@@ -378,15 +450,13 @@ public void testExtractExpressionBindingPaths(String path) {
        null,
        "",
        "event_id", // missing root


so you've cut some valid paths here. where do they get tested and why the move? is they return a different type now?

steveloughran · 2026-04-14T20:29:23Z

+  /** Letters that follow {@code \} for control-character escapes in RFC 9535 quoted segments. */
+  private static final String RFC9535_SIMPLE_ESCAPE_LETTERS = "btnfr";
+
+  private static final String RFC9535_SIMPLE_ESCAPE_CHARS = "\b\t\n\f\r";


had to look \f up. never seen it or used it. But if it's in the spec, it needs coverage

IMHO Json spec is double edged, it is both developer friendly and unfriendly

- Add PathUtil.toDotNotation for BoundExtract unbind - Handle UnboundExtract/BoundExtract in ExpressionUtil.describe and unbind - Use little-endian ByteBuffer in InclusiveMetricsEvaluator.parseBounds for variant bounds

Adddress PR feeback to support dot in path. - Support dot, bracket and mixed style jsonpath. Use bracket for field names with dot, or other special characters. - Parse paths into Name/Index segments; normalize to $['a'][0]['b'] form - Keep extract() path canonicalization in UnboundExtract; unbind preserves normalized path - New unit tests

Address PR review comments: - BoundExtract: Avoid double normalization of the path string. - PathUtil: map array index parse failures to IllegalArgumentException, clean up code and improve documentation. - Tests: Add new helper to avoid code duplication, new test cases for invalid paths.

github-actions Bot added API core labels Feb 20, 2026

qlong mentioned this pull request Feb 20, 2026

Spark: Support variant_get predicate pushdown for file skipping #15385

Open

deniskuzZ reviewed Feb 23, 2026

View reviewed changes

deniskuzZ approved these changes Feb 24, 2026

View reviewed changes

nssalian reviewed Mar 30, 2026

View reviewed changes

qlong force-pushed the extract-term-expression-util-support branch 3 times, most recently from 5009a11 to fea1f66 Compare April 2, 2026 21:11

qlong requested a review from nssalian April 2, 2026 23:31

nssalian reviewed Apr 14, 2026

View reviewed changes

Comment thread api/src/main/java/org/apache/iceberg/expressions/PathUtil.java Outdated

Comment thread api/src/main/java/org/apache/iceberg/expressions/UnboundExtract.java

steveloughran suggested changes Apr 14, 2026

View reviewed changes

qlong force-pushed the extract-term-expression-util-support branch 2 times, most recently from a1473ca to 4196ad3 Compare April 23, 2026 19:44

steveloughran mentioned this pull request May 1, 2026

Core, Spark: Performant queries over (shredded) Variant data #16172

Open

3 tasks

qlong added 3 commits May 15, 2026 11:33

Api: Support variant extract and fix manifest bounds byte order

98dd398

- Add PathUtil.toDotNotation for BoundExtract unbind - Handle UnboundExtract/BoundExtract in ExpressionUtil.describe and unbind - Use little-endian ByteBuffer in InclusiveMetricsEvaluator.parseBounds for variant bounds

qlong force-pushed the extract-term-expression-util-support branch from 4196ad3 to ce30f20 Compare May 15, 2026 16:38

Conversation

qlong commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qlong Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

deniskuzZ left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Feb 25, 2026

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nssalian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

qlong commented Feb 20, 2026 •

edited

Loading

deniskuzZ Feb 23, 2026 •

edited

Loading

qlong Feb 23, 2026 •

edited

Loading

deniskuzZ Feb 24, 2026 •

edited

Loading