Api: Support variant extract and fix manifest bounds byte order#15384
Api: Support variant extract and fix manifest bounds byte order#15384qlong wants to merge 3 commits into
Conversation
| return ((BoundReference<?>) term).name(); | ||
| } else if (term instanceof UnboundExtract) { | ||
| UnboundExtract<?> unboundExtract = (UnboundExtract<?>) term; | ||
| return "extract(" |
| private static VariantObject parseBounds(ByteBuffer buffer) { | ||
| return Variant.from(buffer).value().asObject(); | ||
| // Explicitly use little-endian encoding for reading buffer | ||
| ByteBuffer littleEndian = buffer.duplicate().order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
don't we expect ByteBuffer for variant already with little-endian encoding?
There was a problem hiding this comment.
You are right that variant bounds are already stored as little-endian. However, ByteBuffer.wrap() in Java defaults to big-endian byte order, hence a mismatch when reading the manifest bound bytes. ParquetVariantReaders.readBinary has the same issue and also sets .order(ByteOrder.LITTLE_ENDIAN).
There was a problem hiding this comment.
but it doesn't allocate new buffer (buffer.duplicate()), should we?
There was a problem hiding this comment.
I think we need to create a new buffer so the the order(ByteOrder.LITTLE_ENDIAN) call does not change the encoding of the buffer that is passed in. ParquetVariantReaders.readBinary creates a fresh buffer
There was a problem hiding this comment.
In my opinion, we should store the variant buffer in lowerBounds/upperBounds using the proper encoding, rather than applying the encoding later in parseBounds.
At the same time i can see similar pattern used in a project
ByteBuffer tmp = buffer.duplicate();
tmp.order(ByteOrder.LITTLE_ENDIAN);
There was a problem hiding this comment.
I've just discovered today that Arm64 defaults to being little endian too, so even if the java default is from the SPARC era, little endian is what we need for performance on x86 and raspberry pis (oh, and macbooks, AWS Graviton parts...)
| new String[][] { | ||
| new String[] {"$", "$"}, | ||
| new String[] {"$['a']", "$.a"}, | ||
| new String[] {"$['a']['b']", "$.a.b"}, |
There was a problem hiding this comment.
does it cover structs? shredded variant inside struct or shredded struct field in variant?
There was a problem hiding this comment.
Yes,
|
cc @aihuaxu |
nssalian
left a comment
There was a problem hiding this comment.
Thanks for the PR. Left a couple of comments.
| "Invalid normalized path: %s", | ||
| normalizedPath); | ||
| List<String> fields = Lists.newArrayList(); | ||
| Matcher matcher = Pattern.compile("\\['([^']*)'\\]").matcher(normalizedPath); |
There was a problem hiding this comment.
Could you add this Pattern to a static field at the top like other Patterns?
| if (fields.isEmpty()) { | ||
| return ROOT; | ||
| } | ||
| return ROOT + "." + String.join(".", fields); |
There was a problem hiding this comment.
Can a field name ever contain a dot? If so, toDotNotation("$['user.name']") breaks the round-trip. PathUtil.parse() would split it into two fields. If that's possible, might be simpler to have unbind() pass the normalized path directly instead of converting back to dot notation.
There was a problem hiding this comment.
Good call out. json field name can have dot, and you are right toDotNotation("$['user.name']") breaks the round trip, as dot style path itself cannot represent "user.name" as a single segment.
Digging code more, iceberg only supports dot style path for variant (https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/expressions/PathUtil.java#L55-L56), internally it converts to bracket style (#12835). I think this is pretty strong limitation as it would not support dot in field name. Spark supports bracket, dot, and mixed styple path for variant, so $.employee['user.name'] works. I think we should support those styles as well. I can add that to this PR as the current change is small.
There was a problem hiding this comment.
added support of bracket style path, so we can support $['user.name']. Got rid of toDotNotation completely as suggested .
5009a11 to
fea1f66
Compare
nssalian
left a comment
There was a problem hiding this comment.
A couple of comments but overall this is directionally good. @huaxingao @aihuaxu @steveloughran PTAL
steveloughran
left a comment
There was a problem hiding this comment.
commented,
production code: comments
tests: code changes.
| private static VariantObject parseBounds(ByteBuffer buffer) { | ||
| return Variant.from(buffer).value().asObject(); | ||
| // Explicitly use little-endian encoding for reading buffer | ||
| ByteBuffer littleEndian = buffer.duplicate().order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
I've just discovered today that Arm64 defaults to being little endian too, so even if the java default is from the SPARC era, little endian is what we need for performance on x86 and raspberry pis (oh, and macbooks, AWS Graviton parts...)
| * One step in a variant JSONPath: an object member name or a zero-based array index (RFC 9535 | ||
| * {@code [n]} selector). | ||
| */ | ||
| sealed interface PathSegment permits PathSegment.Name, PathSegment.Index { |
There was a problem hiding this comment.
I'm going to have to play with these java17 features; reminds me a lot of standard ML etc. We don't get the full switching on them until Java21, do we?
| @@ -378,15 +450,13 @@ public void testExtractExpressionBindingPaths(String path) { | |||
| null, | |||
| "", | |||
| "event_id", // missing root | |||
There was a problem hiding this comment.
so you've cut some valid paths here. where do they get tested and why the move? is they return a different type now?
| /** Letters that follow {@code \} for control-character escapes in RFC 9535 quoted segments. */ | ||
| private static final String RFC9535_SIMPLE_ESCAPE_LETTERS = "btnfr"; | ||
|
|
||
| private static final String RFC9535_SIMPLE_ESCAPE_CHARS = "\b\t\n\f\r"; |
There was a problem hiding this comment.
had to look \f up. never seen it or used it. But if it's in the spec, it needs coverage
There was a problem hiding this comment.
IMHO Json spec is double edged, it is both developer friendly and unfriendly
a1473ca to
4196ad3
Compare
- Add PathUtil.toDotNotation for BoundExtract unbind - Handle UnboundExtract/BoundExtract in ExpressionUtil.describe and unbind - Use little-endian ByteBuffer in InclusiveMetricsEvaluator.parseBounds for variant bounds
Adddress PR feeback to support dot in path. - Support dot, bracket and mixed style jsonpath. Use bracket for field names with dot, or other special characters. - Parse paths into Name/Index segments; normalize to $['a'][0]['b'] form - Keep extract() path canonicalization in UnboundExtract; unbind preserves normalized path - New unit tests
Address PR review comments: - BoundExtract: Avoid double normalization of the path string. - PathUtil: map array index parse failures to IllegalArgumentException, clean up code and improve documentation. - Tests: Add new helper to avoid code duplication, new test cases for invalid paths.
4196ad3 to
ce30f20
Compare
Changes:
Variant bound fix is required manifest-based file skipping for variant columns. The
ExpressionUtil and PathUtil changes allow variant extract terms to be
described and unbound for EXPLAIN and sanitization.
Testing:
Related Issues: