Skip to content

Api: Support variant extract and fix manifest bounds byte order#15384

Open
qlong wants to merge 3 commits into
apache:mainfrom
qlong:extract-term-expression-util-support
Open

Api: Support variant extract and fix manifest bounds byte order#15384
qlong wants to merge 3 commits into
apache:mainfrom
qlong:extract-term-expression-util-support

Conversation

@qlong
Copy link
Copy Markdown

@qlong qlong commented Feb 20, 2026

Changes:

  • Add support for bracket style and mixed style of jsonpath, like $.user.name, or $['user']['name]. This change is needed to support dot or special characters in field names
  • Handle UnboundExtract/BoundExtract in ExpressionUtil.describe and unbind
  • Variant bounds fix: Use little-endian ByteBuffer in InclusiveMetricsEvaluator.parseBounds for variant bounds

Variant bound fix is required manifest-based file skipping for variant columns. The
ExpressionUtil and PathUtil changes allow variant extract terms to be
described and unbound for EXPLAIN and sanitization.

Testing:

  • unit tests

Related Issues:

  1. Variant Data Type Support #10392

return ((BoundReference<?>) term).name();
} else if (term instanceof UnboundExtract) {
UnboundExtract<?> unboundExtract = (UnboundExtract<?>) term;
return "extract("
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

private static VariantObject parseBounds(ByteBuffer buffer) {
return Variant.from(buffer).value().asObject();
// Explicitly use little-endian encoding for reading buffer
ByteBuffer littleEndian = buffer.duplicate().order(ByteOrder.LITTLE_ENDIAN);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we expect ByteBuffer for variant already with little-endian encoding?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right that variant bounds are already stored as little-endian. However, ByteBuffer.wrap() in Java defaults to big-endian byte order, hence a mismatch when reading the manifest bound bytes. ParquetVariantReaders.readBinary has the same issue and also sets .order(ByteOrder.LITTLE_ENDIAN).

Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it doesn't allocate new buffer (buffer.duplicate()), should we?

Copy link
Copy Markdown
Author

@qlong qlong Feb 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to create a new buffer so the the order(ByteOrder.LITTLE_ENDIAN) call does not change the encoding of the buffer that is passed in. ParquetVariantReaders.readBinary creates a fresh buffer

Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, we should store the variant buffer in lowerBounds/upperBounds using the proper encoding, rather than applying the encoding later in parseBounds.

At the same time i can see similar pattern used in a project

    ByteBuffer tmp = buffer.duplicate();
    tmp.order(ByteOrder.LITTLE_ENDIAN);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just discovered today that Arm64 defaults to being little endian too, so even if the java default is from the SPARC era, little endian is what we need for performance on x86 and raspberry pis (oh, and macbooks, AWS Graviton parts...)

new String[][] {
new String[] {"$", "$"},
new String[] {"$['a']", "$.a"},
new String[] {"$['a']['b']", "$.a.b"},
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it cover structs? shredded variant inside struct or shredded struct field in variant?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, $['a']['b'] → $.a.b and $['a']['b']['c'] → $.a.b.c cover paths to nested variant object fields, which correspond to shredded struct fields within a variant. The variant extract path (e.g. $['address']['city']) is independent of where the variant column itself lives in the Iceberg schema — field navigation to the variant column is handled separately by BoundReference.

Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@huaxingao
Copy link
Copy Markdown
Contributor

cc @aihuaxu

Copy link
Copy Markdown
Contributor

@nssalian nssalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Left a couple of comments.

"Invalid normalized path: %s",
normalizedPath);
List<String> fields = Lists.newArrayList();
Matcher matcher = Pattern.compile("\\['([^']*)'\\]").matcher(normalizedPath);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add this Pattern to a static field at the top like other Patterns?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

if (fields.isEmpty()) {
return ROOT;
}
return ROOT + "." + String.join(".", fields);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a field name ever contain a dot? If so, toDotNotation("$['user.name']") breaks the round-trip. PathUtil.parse() would split it into two fields. If that's possible, might be simpler to have unbind() pass the normalized path directly instead of converting back to dot notation.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call out. json field name can have dot, and you are right toDotNotation("$['user.name']") breaks the round trip, as dot style path itself cannot represent "user.name" as a single segment.

Digging code more, iceberg only supports dot style path for variant (https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/PathUtil.java#L55-L56), internally it converts to bracket style (#12835). I think this is pretty strong limitation as it would not support dot in field name. Spark supports bracket, dot, and mixed styple path for variant, so $.employee['user.name'] works. I think we should support those styles as well. I can add that to this PR as the current change is small.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added support of bracket style path, so we can support $['user.name']. Got rid of toDotNotation completely as suggested .

@qlong qlong force-pushed the extract-term-expression-util-support branch 3 times, most recently from 5009a11 to fea1f66 Compare April 2, 2026 21:11
@qlong qlong requested a review from nssalian April 2, 2026 23:31
Copy link
Copy Markdown
Contributor

@nssalian nssalian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of comments but overall this is directionally good. @huaxingao @aihuaxu @steveloughran PTAL

Comment thread api/src/main/java/org/apache/iceberg/expressions/PathUtil.java Outdated
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented,
production code: comments
tests: code changes.

private static VariantObject parseBounds(ByteBuffer buffer) {
return Variant.from(buffer).value().asObject();
// Explicitly use little-endian encoding for reading buffer
ByteBuffer littleEndian = buffer.duplicate().order(ByteOrder.LITTLE_ENDIAN);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just discovered today that Arm64 defaults to being little endian too, so even if the java default is from the SPARC era, little endian is what we need for performance on x86 and raspberry pis (oh, and macbooks, AWS Graviton parts...)

* One step in a variant JSONPath: an object member name or a zero-based array index (RFC 9535
* {@code [n]} selector).
*/
sealed interface PathSegment permits PathSegment.Name, PathSegment.Index {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to have to play with these java17 features; reminds me a lot of standard ML etc. We don't get the full switching on them until Java21, do we?

Comment thread api/src/main/java/org/apache/iceberg/expressions/PathUtil.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/expressions/PathUtil.java Outdated
Comment thread api/src/main/java/org/apache/iceberg/expressions/PathUtil.java Outdated
Comment thread api/src/test/java/org/apache/iceberg/expressions/TestPathUtil.java
Comment thread api/src/test/java/org/apache/iceberg/expressions/TestPathUtil.java
@@ -378,15 +450,13 @@ public void testExtractExpressionBindingPaths(String path) {
null,
"",
"event_id", // missing root
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you've cut some valid paths here. where do they get tested and why the move? is they return a different type now?

/** Letters that follow {@code \} for control-character escapes in RFC 9535 quoted segments. */
private static final String RFC9535_SIMPLE_ESCAPE_LETTERS = "btnfr";

private static final String RFC9535_SIMPLE_ESCAPE_CHARS = "\b\t\n\f\r";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

had to look \f up. never seen it or used it. But if it's in the spec, it needs coverage

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO Json spec is double edged, it is both developer friendly and unfriendly

@qlong qlong force-pushed the extract-term-expression-util-support branch 2 times, most recently from a1473ca to 4196ad3 Compare April 23, 2026 19:44
qlong added 3 commits May 15, 2026 11:33
- Add PathUtil.toDotNotation for BoundExtract unbind
- Handle UnboundExtract/BoundExtract in ExpressionUtil.describe
  and unbind
- Use little-endian ByteBuffer in InclusiveMetricsEvaluator.parseBounds
  for variant bounds
Adddress PR feeback to support dot in path.
- Support dot, bracket and mixed style jsonpath. Use bracket for field
names with dot, or other special characters.
- Parse paths into Name/Index segments; normalize to $['a'][0]['b'] form
- Keep extract() path canonicalization in UnboundExtract; unbind
preserves normalized path
- New unit tests
Address PR review comments:
- BoundExtract: Avoid double normalization of the path string.
- PathUtil: map array index parse failures to IllegalArgumentException,
  clean up code and improve documentation.
- Tests: Add new helper to avoid code duplication, new test cases
  for invalid paths.
@qlong qlong force-pushed the extract-term-expression-util-support branch from 4196ad3 to ce30f20 Compare May 15, 2026 16:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants