[Iceberg v3] Geometry and Geography types#27893
Conversation
bf401e7 to
b2ccfe0
Compare
|
This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack. |
b2ccfe0 to
7dfe940
Compare
c5c66d5 to
29af654
Compare
| int type = ewkb.getInt(1); | ||
| if (bigEndian) { | ||
| type = Integer.reverseBytes(type); | ||
| } | ||
| if ((type & EWKB_SRID_FLAG) == 0) { | ||
| return 0; | ||
| } | ||
| int srid = ewkb.getInt(5); |
There was a problem hiding this comment.
can we do a single getLong instead?
There was a problem hiding this comment.
It turned out more complex.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughAdds EWKB/SRID utilities to JtsGeometrySerde (OGC_CRS84_SRID, EWKB flag, extractSrid, ewkbToWkb, crsToSrid, wkbToEwkb) with endianness-aware SRID read/write and CRS parsing. Integrates geospatial handling into the Iceberg connector: treats GEOMETRY/GEOGRAPHY as binary for file I/O, rewrites types to VARBINARY for writers/readers, strips SRIDs on write, injects/validates SRIDs on read (recursing into arrays/maps/rows), disables predicate pushdown for geospatial domains, and excludes geospatial columns from NDV stats. Adds tests and POM dependencies for geospatial modules and Parquet metadata/annotation round-trips. Possibly related PRs
✨ Finishing Touches⚔️ Resolve merge conflicts
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.42.1)plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.javaTip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java (1)
172-178:⚠️ Potential issue | 🟠 MajorRecursively rewrite geospatial types; current conversion misses nested shapes.
toFileSchema(...)/toFileTrinoType(...)only handle top-level geospatial columns. Nested geospatial types (e.g.,array(geometry),row(... geography ...)) are left unchanged, so writer schemas/types can still contain unsupported geospatial primitives.Proposed fix
@@ - List<String> fileColumnNames = icebergSchema.columns().stream() + Schema fileSchema = toFileSchema(icebergSchema); + List<String> fileColumnNames = fileSchema.columns().stream() .map(Types.NestedField::name) .collect(toImmutableList()); - List<Type> fileColumnTypes = icebergSchema.columns().stream() + List<Type> fileColumnTypes = fileSchema.columns().stream() .map(column -> toTrinoType(column.type(), typeManager)) - .map(IcebergFileWriterFactory::toFileTrinoType) .collect(toImmutableList()); @@ - convert(toFileSchema(icebergSchema), "table"), + convert(fileSchema, "table"), @@ - List<Types.NestedField> columnFields = icebergSchema.columns(); + Schema fileSchema = toFileSchema(icebergSchema); + List<Types.NestedField> columnFields = fileSchema.columns(); @@ .map(Types.NestedField::type) .map(type -> toTrinoType(type, typeManager)) - .map(IcebergFileWriterFactory::toFileTrinoType) .collect(toImmutableList()); @@ - toOrcType(toFileSchema(icebergSchema)), + toOrcType(fileSchema), @@ - List<Type> columnTypes = icebergSchema.columns().stream() + Schema fileSchema = toFileSchema(icebergSchema); + List<Type> columnTypes = fileSchema.columns().stream() .map(column -> toTrinoType(column.type(), typeManager)) - .map(IcebergFileWriterFactory::toFileTrinoType) .collect(toImmutableList()); @@ - toFileSchema(icebergSchema), + fileSchema, columnTypes, compressionCodec); @@ - private static Schema toFileSchema(Schema icebergSchema) - { - List<Types.NestedField> columns = icebergSchema.columns().stream() - .map(IcebergFileWriterFactory::toFileType) - .collect(toImmutableList()); - return new Schema(columns); - } + private static Schema toFileSchema(Schema icebergSchema) + { + return new Schema(icebergSchema.columns().stream() + .map(IcebergFileWriterFactory::toFileType) + .collect(toImmutableList())); + } @@ private static Types.NestedField toFileType(Types.NestedField field) { - org.apache.iceberg.types.Type type = field.type(); - if (type.typeId() == TypeID.GEOMETRY || type.typeId() == TypeID.GEOGRAPHY) { - // Replace geometry/geography with binary for file writing - return Types.NestedField.of(field.fieldId(), field.isOptional(), field.name(), Types.BinaryType.get(), field.doc()); - } - return field; - } - - /** - * Convert Trino type for file writing - geometry/geography become varbinary. - */ - private static Type toFileTrinoType(Type type) - { - if (type instanceof GeometryType || type instanceof SphericalGeographyType) { - return VarbinaryType.VARBINARY; - } - return type; + org.apache.iceberg.types.Type rewritten = toFileType(field.type()); + if (rewritten == field.type()) { + return field; + } + return Types.NestedField.of(field.fieldId(), field.isOptional(), field.name(), rewritten, field.doc()); + } + + private static org.apache.iceberg.types.Type toFileType(org.apache.iceberg.types.Type type) + { + return switch (type.typeId()) { + case GEOMETRY, GEOGRAPHY -> Types.BinaryType.get(); + case STRUCT -> Types.StructType.of(type.asStructType().fields().stream() + .map(IcebergFileWriterFactory::toFileType) + .collect(toImmutableList())); + case LIST -> { + Types.ListType listType = type.asListType(); + org.apache.iceberg.types.Type elementType = toFileType(listType.elementType()); + yield listType.isElementRequired() + ? Types.ListType.ofRequired(listType.elementId(), elementType) + : Types.ListType.ofOptional(listType.elementId(), elementType); + } + case MAP -> { + Types.MapType mapType = type.asMapType(); + org.apache.iceberg.types.Type keyType = toFileType(mapType.keyType()); + org.apache.iceberg.types.Type valueType = toFileType(mapType.valueType()); + yield mapType.isValueRequired() + ? Types.MapType.ofRequired(mapType.keyId(), mapType.valueId(), keyType, valueType) + : Types.MapType.ofOptional(mapType.keyId(), mapType.valueId(), keyType, valueType); + } + default -> type; + }; }Also applies to: 233-237, 312-315, 332-359
🧹 Nitpick comments (2)
plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java (1)
93-93: Consider reversing assertion order for clearer failure messages.AssertJ convention is
assertThat(actual).isEqualTo(expected). The current order places the expected value inassertThat, which would produce a confusing failure message like "expected: [actual] but was: [expected]".💡 Suggested fix
- assertThat(wktList.get(i)).isEqualTo(SPHERICAL_GEOGRAPHY.getObjectValue(block, i)); + assertThat(SPHERICAL_GEOGRAPHY.getObjectValue(block, i)).isEqualTo(wktList.get(i));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java` at line 93, The assertion currently uses assertThat(wktList.get(i)).isEqualTo(SPHERICAL_GEOGRAPHY.getObjectValue(block, i)), which puts the expected value in the assertThat() slot and yields confusing failure messages; swap the operands so the actual value from SPHERICAL_GEOGRAPHY.getObjectValue(block, i) is passed to assertThat(...) and isEqualTo(...) compares to wktList.get(i) (i.e., use assertThat(SPHERICAL_GEOGRAPHY.getObjectValue(block, i)).isEqualTo(wktList.get(i))) to follow AssertJ convention.plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java (1)
1780-1790: Assert the geography payload, not just row presence.This test still passes if the new
sphericalgeographyread/write path corrupts the payload, because it never inspectsgeog. Please round-trip a deterministic value instead of only checking that one row exists.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java` around lines 1780 - 1790, In testGeographyRoundTrip, change the assertion to validate the actual geography payload instead of only row presence: insert a deterministic geometry using to_spherical_geography(ST_Point(...)) (as already done) and then query the stored column (e.g., SELECT ST_AsText(geog) or SELECT geog = to_spherical_geography(ST_Point(...))) to assert the round-tripped value equals the original; update the assertThat(...) call in testGeographyRoundTrip to compare the retrieved geog text/binary to the expected deterministic value so corruption would fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java`:
- Around line 358-374: transformGeometryColumnsForWrite currently only processes
top-level geometry channels and therefore skips SRID validation and EWKB→WKB
conversion for nested geospatial values; update it to recurse into complex
blocks (RowBlock, ArrayBlock, MapBlock) so nested geometry blocks are visited
and passed to transformGeometryBlockForWrite (or alternately add an early
validation that rejects nested geometry types). Specifically, modify
transformGeometryColumnsForWrite to detect complex Block instances and walk
their children recursively, applying transformGeometryBlockForWrite when a child
block corresponds to a geometry type (preserving position counts and offsets),
and ensure SRID lookup uses the same channel/type mapping used now so nested
geometries get validated and converted.
In
`@plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java`:
- Around line 1694-1710: getSridInjectionTransform currently only handles when
the projected column itself is a GeometryType, so selecting structured types
(ROW/ARRAY/MAP) that contain geometry skips SRID injection; modify
getSridInjectionTransform to inspect nested schema/type structures (RowType,
ArrayType, MapType) for GeometryType leaves by recursing the tableSchema
field.type and/or IcebergColumnHandle.getType and build a Block->Block transform
that traverses structured Blocks and calls injectSridIntoBlock for each nested
geometry leaf (preserve positions and offsets for arrays/maps and fields for
rows), or if recursion is undesirable, explicitly detect any nested GeometryType
and throw/return an error disallowing nested geospatial columns; use the
existing symbols getSridInjectionTransform, injectSridIntoBlock,
IcebergColumnHandle, tableSchema.findField and Types.{RowType, ListType,
MapType}/Types.GeometryType to locate and implement the recursion or validation.
- Around line 932-942: getFileReadType currently only converts top-level
GeometryType/SphericalGeographyType to VARBINARY, but for Avro you must recurse
into complex types so nested geospatial leaves are also read as VARBINARY:
update getFileReadType(Type columnType) to detect complex types (RowType,
ArrayType, MapType) and return a new type with the same structure where each
child/field/element/key/value type is replaced by calling getFileReadType
recursively (preserve field names and nullability for RowType, preserve map
key/value semantics for MapType, and element type for ArrayType); keep the
existing simple-case behavior for GeometryType and SphericalGeographyType and
leave other scalar types unchanged so behavior matches getOrcReadType.
- Around line 1715-1754: injectSridIntoBlock currently returns the original
block whenever it isn't a VariableWidthBlock, which allows dictionary-, RLE- or
lazy-backed geometry blocks to bypass SRID injection; update injectSridIntoBlock
to unwrap and handle those wrappers: if block is a DictionaryBlock, call
injectSridIntoBlock on dictionaryBlock.getDictionary() and then return a new
DictionaryBlock that reuses the original ids and positionCount but wraps the
transformed dictionary; if block is a RunLengthEncodedBlock, call
injectSridIntoBlock on runLengthEncodedBlock.getValue() and return a new
RunLengthEncodedBlock with the transformed value and the original positionCount;
if block is a LazyBlock, force/load the underlying block (lazyBlock.getBlock())
and recurse similarly; for any other block types continue to return the original
block. Ensure you still perform the existing VariableWidthBlock SRID
transformation (keeping function name injectSridIntoBlock and reusing the
existing VariableWidthBlock handling) and preserve nulls, offsets and original
wrapper semantics when reconstructing wrapper blocks.
---
Nitpick comments:
In
`@plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java`:
- Line 93: The assertion currently uses
assertThat(wktList.get(i)).isEqualTo(SPHERICAL_GEOGRAPHY.getObjectValue(block,
i)), which puts the expected value in the assertThat() slot and yields confusing
failure messages; swap the operands so the actual value from
SPHERICAL_GEOGRAPHY.getObjectValue(block, i) is passed to assertThat(...) and
isEqualTo(...) compares to wktList.get(i) (i.e., use
assertThat(SPHERICAL_GEOGRAPHY.getObjectValue(block,
i)).isEqualTo(wktList.get(i))) to follow AssertJ convention.
In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`:
- Around line 1780-1790: In testGeographyRoundTrip, change the assertion to
validate the actual geography payload instead of only row presence: insert a
deterministic geometry using to_spherical_geography(ST_Point(...)) (as already
done) and then query the stored column (e.g., SELECT ST_AsText(geog) or SELECT
geog = to_spherical_geography(ST_Point(...))) to assert the round-tripped value
equals the original; update the assertThat(...) call in testGeographyRoundTrip
to compare the retrieved geog text/binary to the expected deterministic value so
corruption would fail the test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: cfe86063-c19c-407a-a2f2-c1f96b5aa41b
📒 Files selected for processing (14)
lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.javaplugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.javaplugin/trino-iceberg/pom.xmlplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/aggregation/IcebergThetaSketchForStats.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
29af654 to
09bef94
Compare
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java (1)
1684-1715: Exercise the custom-SRID data path, not just registration.This test stops at metadata checks, so it never hits the
EPSG:3857read/write logic added inIcebergPageSourceProviderandIcebergPageSink. Adding one row and assertingST_SRID(geom) = 3857would cover the new behavior end to end.Suggested test extension
assertThat(query("DESCRIBE " + registered)) .matches("VALUES (VARCHAR 'id', VARCHAR 'integer', VARCHAR '', VARCHAR ''), " + "(VARCHAR 'geom', VARCHAR 'Geometry', VARCHAR '', VARCHAR '')"); + assertUpdate("INSERT INTO " + registered + " VALUES (1, ST_SetSRID(ST_Point(1, 2), 3857))", 1); + assertThat(query("SELECT ST_SRID(geom) FROM " + registered)) + .matches("VALUES 3857"); + assertUpdate("DROP TABLE " + registered);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java` around lines 1684 - 1715, Extend TestIcebergV3.testGeometryWithCustomSrid to exercise actual read/write of the EPSG:3857 geometry: after registering the table (registered), INSERT one row with a geometry value into the registered table (so IcebergPageSink is used) and then SELECT the geometry back asserting ST_SRID(geom) = 3857 (and optionally validating the geometry value) to exercise the IcebergPageSink/IcebergPageSourceProvider code paths for custom SRID handling.lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java (1)
275-295: Consider defensive check for already-present SRID flag.If the input slice is EWKB (already has SRID flag set), the method would produce corrupted output by copying the old SRID as part of the geometry data. While the Javadoc clearly states the input should be "WKB", a defensive check would prevent silent data corruption from accidental misuse.
🛡️ Optional defensive check
public static Slice wkbToEwkb(Slice wkb, int srid) { checkArgument(wkb.length() >= 5, "WKB too short"); boolean bigEndian = wkb.getByte(0) == 0; int type = wkb.getInt(1); if (bigEndian) { type = Integer.reverseBytes(type); } + checkArgument((type & EWKB_SRID_FLAG) == 0, "Input already has SRID flag set (expected WKB, got EWKB)"); // Add SRID flag int newType = type | EWKB_SRID_FLAG;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java` around lines 275 - 295, wkbToEwkb currently blindly injects an SRID flag and copies bytes, which corrupts input if the incoming Slice already has EWKB SRID set; add a defensive check at the start of wkbToEwkb by inspecting the parsed type (the local variable type obtained from wkb.getInt(1) and possibly reversed) and if (type & EWKB_SRID_FLAG) != 0 either throw an IllegalArgumentException or simply return the original Slice unchanged (choose consistent behavior with the codebase), with a clear message mentioning EWKB_SRID_FLAG and the method JtsGeometrySerde.wkbToEwkb so callers know the input already contains an SRID and no injection is needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.java`:
- Around line 119-121: In the GEOMETRY branch of TypeConverter, validate the
geometry CRS before returning GEOMETRY: extract the CRS string from the Iceberg
geometry(crs) type, call crsToSrid(crs) to validate/parse it, and if crsToSrid
indicates an unsupported format or invalid EPSG, reject the type
deterministically (e.g., throw a TrinoException with NOT_SUPPORTED) instead of
silently returning GEOMETRY; only return GEOMETRY when crsToSrid succeeds.
---
Nitpick comments:
In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`:
- Around line 275-295: wkbToEwkb currently blindly injects an SRID flag and
copies bytes, which corrupts input if the incoming Slice already has EWKB SRID
set; add a defensive check at the start of wkbToEwkb by inspecting the parsed
type (the local variable type obtained from wkb.getInt(1) and possibly reversed)
and if (type & EWKB_SRID_FLAG) != 0 either throw an IllegalArgumentException or
simply return the original Slice unchanged (choose consistent behavior with the
codebase), with a clear message mentioning EWKB_SRID_FLAG and the method
JtsGeometrySerde.wkbToEwkb so callers know the input already contains an SRID
and no injection is needed.
In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`:
- Around line 1684-1715: Extend TestIcebergV3.testGeometryWithCustomSrid to
exercise actual read/write of the EPSG:3857 geometry: after registering the
table (registered), INSERT one row with a geometry value into the registered
table (so IcebergPageSink is used) and then SELECT the geometry back asserting
ST_SRID(geom) = 3857 (and optionally validating the geometry value) to exercise
the IcebergPageSink/IcebergPageSourceProvider code paths for custom SRID
handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: f363ea84-a267-4c8f-b184-cd96420316b7
📒 Files selected for processing (14)
lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.javalib/trino-geospatial-toolkit/src/test/java/io/trino/geospatial/serde/TestGeometrySerialization.javaplugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.javaplugin/trino-iceberg/pom.xmlplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
✅ Files skipped from review due to trivial changes (5)
- plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java
- plugin/trino-iceberg/pom.xml
- plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.java
- plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
- plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
🚧 Files skipped from review as they are similar to previous changes (2)
- plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.java
- plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
a419dd1 to
2950790
Compare
|
Can we separate the commits for geometry and geography ? |
@raunaqmorarka Why? Seems like a bunch of what for not much win. This is not a big PR.
I can take a look. |
I took a look, and this is not as simple as adding a hard-coded logical annotation. The geometry/geography parameters would need to be carried through the Parquet schema annotations, and as far as I can tell that would require Parquet/Iceberg library support updates on our side to propagate those parameters end-to-end. So I think this is better handled as follow-up work in the Parquet schema conversion stack rather than folded into this PR. |
We usually make separate commits for independent pieces of functionality. It helps with review, nicer git history etc. Unless these functionalities cannot exist separately, I think it makes sense for the commits to be separate. Hopefully claude/codex can do this without too much fuss. |
I'm mainly worried about cross-engine compatibility. I see that Databricks/Spark doesn't support it yet, but SnowFlake does https://docs.snowflake.com/en/en/user-guide/tables-iceberg-data-types#label-tables-iceberg-v3-data-types |
2950790 to
bc1fc0c
Compare
|
Rebased to solve conflicts |
b59d0c6 to
c5e5819
Compare
Implement reading and writing of geometry columns in Iceberg tables using the Iceberg v3 geometry type specification.
c5e5819 to
2964e92
Compare
I added a commit that does this. |
2964e92 to
52e283b
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java (1)
182-189: ⚡ Quick winStrengthen geography round-trip coverage with an explicit CRS case.
Current assertions validate defaults; adding an explicit CRS scenario would directly guard the new metadata-preservation path.
Proposed test addition
`@Test` void testGeographyLogicalTypeRoundTrip() { LogicalType logicalType = convertToLogicalType(geographyType()); assertThat(logicalType.getGEOGRAPHY().getCrs()).isEqualTo("OGC:CRS84"); assertThat(logicalType.getGEOGRAPHY().getAlgorithm().name()).isEqualTo("SPHERICAL"); assertThat(getLogicalTypeAnnotation(logicalType)).isEqualTo(geographyType("OGC:CRS84", DEFAULT_ALGO)); } + +@Test +void testGeographyLogicalTypeRoundTripWithExplicitCrs() +{ + LogicalType logicalType = convertToLogicalType(geographyType("EPSG:4326", DEFAULT_ALGO)); + + assertThat(logicalType.getGEOGRAPHY().getCrs()).isEqualTo("EPSG:4326"); + assertThat(logicalType.getGEOGRAPHY().getAlgorithm().name()).isEqualTo("SPHERICAL"); + assertThat(getLogicalTypeAnnotation(logicalType)).isEqualTo(geographyType("EPSG:4326", DEFAULT_ALGO)); +}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java` around lines 182 - 189, Add a second explicit-CRS assertion to testGeographyLogicalTypeRoundTrip: call convertToLogicalType with geographyType("EPSG:4326", DEFAULT_ALGO) (or another explicit CRS), then assert that logicalType.getGEOGRAPHY().getCrs() equals the explicit CRS string, that getGEOGRAPHY().getAlgorithm().name() matches the algorithm, and that getLogicalTypeAnnotation(logicalType) equals the original geographyType("EPSG:4326", DEFAULT_ALGO) to ensure metadata is preserved both ways.plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java (1)
35-48: ⚡ Quick winAdd geography and write-side geospatial mapping coverage here.
This file currently pins only the geometry read-path. The new geography validation and the
toIcebergType*geometry/geography mappings are still untested, so a regression in those connector-visible branches would slip through unnoticed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java` around lines 35 - 48, The tests only cover geometry read-paths; add unit tests that also validate geography read-paths and the write-side mappings by exercising toTrinoType for geography Types.GeographyType.of(...) and the inverse mapping helpers (e.g., toIcebergType and any toIcebergTypeForWrite or similarly named methods) to ensure both geometry and geography convert correctly on read and write; update TestTypeConverter to include assertions for geography -> TRINO type equality (similar to existing GEOMETRY assertions) and assert the toIcebergType* mappings round-trip or produce expected Iceberg Types and throw appropriate errors for invalid CRSs, referencing toTrinoType, Types.GeometryType.of/Types.GeographyType.of, TESTING_TYPE_MANAGER, and the toIcebergType* functions used in the codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`:
- Around line 203-205: The byte-order marker (ewkb.getByte(0)) is being treated
as little-endian for any non-zero value; update JtsGeometrySerde to validate the
marker is either 0 or 1 before parsing and fail fast otherwise. For each
location where you compute boolean bigEndian = ewkb.getByte(0) == 0 (and the
similar occurrences around the referenced lines), add an explicit check that the
marker byte is 0 or 1 and throw an IllegalArgumentException (or appropriate
parsing exception) with a clear message containing the invalid marker value when
it isn’t; then proceed to set bigEndian = (marker == 0). Ensure you apply this
same validation in all three spots (the occurrences around the existing boolean
bigEndian assignments).
In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`:
- Around line 1810-1813: The test currently verifies only the WKT of the
geometry after round-tripping (via to_spherical_geography(ST_Point(...)) -> geog
-> to_geometry(geog) -> ST_AsText) which misses incorrect SRID; update the
assertion to also check the SRID by querying ST_SRID(to_geometry(geog)) for the
same row (use the same table reference from table.getName()) and assert it
equals the expected SRID (e.g., 4326) alongside the existing ST_AsText check
(adjust the assertThat/query invocation that wraps to_geometry(geog), ST_AsText,
and add ST_SRID to the returned columns or a separate assertThat on query(...)).
- Around line 1745-1749: The test in TestIcebergV3.java currently asserts
ST_SRID only for id = 1 (the point) but not for id = 2 (the polygon); add a
parallel assertion that runs query("SELECT ST_SRID(geom) FROM " +
table.getName() + " WHERE id = 2") and assert it matches the expected SRID
(4326) so that both geometry rows' SRIDs are validated.
---
Nitpick comments:
In
`@lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java`:
- Around line 182-189: Add a second explicit-CRS assertion to
testGeographyLogicalTypeRoundTrip: call convertToLogicalType with
geographyType("EPSG:4326", DEFAULT_ALGO) (or another explicit CRS), then assert
that logicalType.getGEOGRAPHY().getCrs() equals the explicit CRS string, that
getGEOGRAPHY().getAlgorithm().name() matches the algorithm, and that
getLogicalTypeAnnotation(logicalType) equals the original
geographyType("EPSG:4326", DEFAULT_ALGO) to ensure metadata is preserved both
ways.
In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java`:
- Around line 35-48: The tests only cover geometry read-paths; add unit tests
that also validate geography read-paths and the write-side mappings by
exercising toTrinoType for geography Types.GeographyType.of(...) and the inverse
mapping helpers (e.g., toIcebergType and any toIcebergTypeForWrite or similarly
named methods) to ensure both geometry and geography convert correctly on read
and write; update TestTypeConverter to include assertions for geography -> TRINO
type equality (similar to existing GEOMETRY assertions) and assert the
toIcebergType* mappings round-trip or produce expected Iceberg Types and throw
appropriate errors for invalid CRSs, referencing toTrinoType,
Types.GeometryType.of/Types.GeographyType.of, TESTING_TYPE_MANAGER, and the
toIcebergType* functions used in the codebase.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 78ecc971-5e5b-4bb9-af39-8adbcbb9e987
📒 Files selected for processing (20)
lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.javalib/trino-geospatial-toolkit/src/test/java/io/trino/geospatial/serde/TestGeometrySerialization.javalib/trino-parquet/src/main/java/io/trino/parquet/ParquetMetadataConverter.javalib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.javaplugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.javaplugin/trino-iceberg/pom.xmlplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/GeoSpatialUtils.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/IcebergParquetSchemaConverter.javaplugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/util/TestIcebergParquetSchemaConverter.java
| boolean bigEndian = ewkb.getByte(0) == 0; | ||
| int type = ewkb.getInt(1); | ||
| if (bigEndian) { |
There was a problem hiding this comment.
Validate EWKB/WKB byte-order marker before parsing.
Right now any marker other than 0 is treated as little-endian. Malformed inputs should fail fast instead of being parsed with a guessed endianness.
🛠️ Proposed fix
public static int extractSrid(Slice ewkb)
{
if (ewkb.length() < 9) {
return 0;
}
- boolean bigEndian = ewkb.getByte(0) == 0;
+ byte byteOrder = ewkb.getByte(0);
+ checkArgument(byteOrder == 0 || byteOrder == 1, "invalid WKB endianness: %s", byteOrder);
+ boolean bigEndian = byteOrder == 0;
int type = ewkb.getInt(1);
if (bigEndian) {
type = Integer.reverseBytes(type);
}
@@
public static Slice ewkbToWkb(Slice ewkb)
{
if (ewkb.length() < 9) {
return ewkb;
}
- boolean bigEndian = ewkb.getByte(0) == 0;
+ byte byteOrder = ewkb.getByte(0);
+ checkArgument(byteOrder == 0 || byteOrder == 1, "invalid WKB endianness: %s", byteOrder);
+ boolean bigEndian = byteOrder == 0;
int type = ewkb.getInt(1);
if (bigEndian) {
type = Integer.reverseBytes(type);
}
@@
public static Slice wkbToEwkb(Slice wkb, int srid)
{
checkArgument(wkb.length() >= 5, "WKB too short");
- boolean bigEndian = wkb.getByte(0) == 0;
+ byte byteOrder = wkb.getByte(0);
+ checkArgument(byteOrder == 0 || byteOrder == 1, "invalid WKB endianness: %s", byteOrder);
+ boolean bigEndian = byteOrder == 0;
int type = wkb.getInt(1);
if (bigEndian) {
type = Integer.reverseBytes(type);
}Also applies to: 227-229, 281-282
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`
around lines 203 - 205, The byte-order marker (ewkb.getByte(0)) is being treated
as little-endian for any non-zero value; update JtsGeometrySerde to validate the
marker is either 0 or 1 before parsing and fail fast otherwise. For each
location where you compute boolean bigEndian = ewkb.getByte(0) == 0 (and the
similar occurrences around the referenced lines), add an explicit check that the
marker byte is 0 or 1 and throw an IllegalArgumentException (or appropriate
parsing exception) with a clear message containing the invalid marker value when
it isn’t; then proceed to set bigEndian = (marker == 0). Ensure you apply this
same validation in all three spots (the occurrences around the existing boolean
bigEndian assignments).
| assertThat(query("SELECT id, ST_AsText(geom) FROM " + table.getName() + " ORDER BY id")) | ||
| .matches("VALUES (1, VARCHAR 'POINT (1 2)'), (2, VARCHAR 'POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))')"); | ||
|
|
||
| assertThat(query("SELECT ST_SRID(geom) FROM " + table.getName() + " WHERE id = 1")) | ||
| .matches("VALUES 4326"); |
There was a problem hiding this comment.
Assert SRID for the polygon row too.
On Line 1745, the test only validates ST_SRID for the point row. A regression in the polygon path would still pass.
♻️ Proposed fix
assertThat(query("SELECT id, ST_AsText(geom) FROM " + table.getName() + " ORDER BY id"))
.matches("VALUES (1, VARCHAR 'POINT (1 2)'), (2, VARCHAR 'POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))')");
assertThat(query("SELECT ST_SRID(geom) FROM " + table.getName() + " WHERE id = 1"))
.matches("VALUES 4326");
+ assertThat(query("SELECT ST_SRID(geom) FROM " + table.getName() + " WHERE id = 2"))
+ .matches("VALUES 4326");🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`
around lines 1745 - 1749, The test in TestIcebergV3.java currently asserts
ST_SRID only for id = 1 (the point) but not for id = 2 (the polygon); add a
parallel assertion that runs query("SELECT ST_SRID(geom) FROM " +
table.getName() + " WHERE id = 2") and assert it matches the expected SRID
(4326) so that both geometry rows' SRIDs are validated.
| assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1); | ||
|
|
||
| assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName())) | ||
| .matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')"); |
There was a problem hiding this comment.
Verify the geography SRID, not just the text representation.
On Line 1812, to_geometry(geog) is only checked with ST_AsText, so a bad SRID would still be missed. Add an explicit ST_SRID assertion to cover the full round-trip.
♻️ Proposed fix
assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1);
assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName()))
.matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')");
+ assertThat(query("SELECT ST_SRID(to_geometry(geog)) FROM " + table.getName()))
+ .matches("VALUES 4326");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1); | |
| assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName())) | |
| .matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')"); | |
| assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1); | |
| assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName())) | |
| .matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')"); | |
| assertThat(query("SELECT ST_SRID(to_geometry(geog)) FROM " + table.getName())) | |
| .matches("VALUES 4326"); |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`
around lines 1810 - 1813, The test currently verifies only the WKT of the
geometry after round-tripping (via to_spherical_geography(ST_Point(...)) -> geog
-> to_geometry(geog) -> ST_AsText) which misses incorrect SRID; update the
assertion to also check the SRID by querying ST_SRID(to_geometry(geog)) for the
same row (use the same table reference from table.getName()) and assert it
equals the expected SRID (e.g., 4326) alongside the existing ST_AsText check
(adjust the assertThat/query invocation that wraps to_geometry(geog), ST_AsText,
and add ST_SRID to the returned columns or a separate assertThat on query(...)).
| } | ||
| case STRING -> ImmutableList.of(new OrcType(OrcTypeKind.STRING, ImmutableList.of(), ImmutableList.of(), Optional.empty(), Optional.empty(), Optional.empty(), attributes)); | ||
| case FIXED, BINARY -> ImmutableList.of(new OrcType(OrcTypeKind.BINARY, ImmutableList.of(), ImmutableList.of(), Optional.empty(), Optional.empty(), Optional.empty(), attributes)); | ||
| case FIXED, BINARY, GEOMETRY, GEOGRAPHY -> ImmutableList.of(new OrcType(OrcTypeKind.BINARY, ImmutableList.of(), ImmutableList.of(), Optional.empty(), Optional.empty(), Optional.empty(), attributes)); |
There was a problem hiding this comment.
Should this be doing
.put(ICEBERG_BINARY_TYPE, "GEOMETRY/GEOGRAPHY")
?
| } | ||
| } | ||
|
|
||
| private static class SortOrderVisitor |
There was a problem hiding this comment.
Parquet spec is requiring these new tyes to have no min/max stats, I think we have to return SortOrder.UNKNOWN for Geometry/GeographyLogicalTypeAnnotation
| return switch (type.typeId()) { | ||
| case GEOMETRY -> true; | ||
| case LIST -> containsGeometry(type.asListType().elementType()); | ||
| case MAP -> containsGeometry(type.asMapType().keyType()) || containsGeometry(type.asMapType().valueType()); | ||
| case STRUCT -> type.asStructType().fields().stream().anyMatch(field -> containsGeometry(field.type())); | ||
| default -> false; |
There was a problem hiding this comment.
Note
This is Claude-generated analysis — please verify the reproducer and reasoning before acting on it.
EWKB bytes can leak into GEOGRAPHY columns on write, producing files that aren't valid ISO WKB.
This switch only matches case GEOMETRY, and appendTransformedGeometryValueForWrite only handles Types.GeometryType, so transformGeometryColumnsForWrite skips GEOGRAPHY columns entirely and the block bytes flow to the writer unchanged.
That matters because SPHERICAL_GEOGRAPHY block bytes are produced by JtsGeometrySerde.serialize (WKBWriter(2, true)), which sets the EWKB SRID flag (0x20000000) when the JTS Geometry's SRID is non-zero. The read path here always seeds a non-zero SRID into geometry blocks (appendTransformedGeometryValueForRead → wkbToEwkb(slice, getGeometrySrid(...)), defaulting to OGC_CRS84_SRID = 4326), so it's easy to construct a SPHERICAL_GEOGRAPHY value carrying that SRID and have it written verbatim.
Reproducer using only standard Trino SQL:
CREATE TABLE iceberg.s.geom_t (g geometry) WITH (format_version = 3);
INSERT INTO iceberg.s.geom_t VALUES (ST_GeometryFromText('POINT(1 2)'));
CREATE TABLE iceberg.s.geog_t (g geography) WITH (format_version = 3);
INSERT INTO iceberg.s.geog_t
SELECT to_spherical_geography(g) FROM iceberg.s.geom_t;
The Parquet column backing geog_t then contains bytes like 01 01 00 00 20 ... — a Point with the EWKB SRID flag set and an SRID int prefix. A spec-conforming reader hits an unknown WKB type code on the very first record.
Suggested fix: extend both containsGeometry(org.apache.iceberg.types.Type) switches (here and in IcebergPageSourceProvider) to also match GEOGRAPHY, and add a Types.GeographyType branch to appendTransformedGeometryValueForWrite that strips the SRID via ewkbToWkb, symmetric to the GEOMETRY branch. The read-side counterpart should mirror the GEOMETRY path so a registered geography(EPSG:NNN) table round-trips its SRID into the block. A test along the lines of the reproducer (assert the on-disk bytes don't have the EWKB flag bit, or read with parquet-mr / pyiceberg and confirm WKB parses) would catch this and prevent regressions.
|
|
||
| // Validate SRID: fail only if both source and target are non-zero and different | ||
| if (sourceSrid != 0 && targetSrid != 0 && sourceSrid != targetSrid) { | ||
| throw new TrinoException(INVALID_FUNCTION_ARGUMENT, | ||
| "SRID mismatch: cannot write geometry with SRID %d into column %d with SRID %d".formatted(sourceSrid, columnIndex, targetSrid)); |
There was a problem hiding this comment.
Note
This is Claude-generated analysis — verified by running a test against this PR branch (results below).
The validation at line 417 treats sourceSrid == 0 as a wildcard, so a value with no embedded SRID can be inserted into a column declared with any non-default CRS without error. On read, getGeometrySrid injects the column's CRS-derived SRID, so the value the user gets back is silently re-tagged with a CRS the original data didn't have. Easy to do accidentally with ST_Point(...) / ST_GeometryFromText(...) (both produce SRID=0) against a column registered with e.g. EPSG:3857.
Verified against this PR's branch (52e283b) with this test added to TestIcebergV3 — it passes, demonstrating the bug:
@Test
void testWriteSridZeroIntoCustomCrsColumn_demonstratesSilentReassignment()
{
String hadoopTableName = "hadoop_srid_zero_into_custom_" + randomNameSuffix();
Path hadoopTableLocation = dataDirectory.resolve(hadoopTableName);
Schema schema = new Schema(
Types.NestedField.optional(1, "id", Types.IntegerType.get()),
Types.NestedField.optional(2, "geom", Types.GeometryType.of("EPSG:3857")));
new HadoopTables(new Configuration(false)).create(
schema,
PartitionSpec.unpartitioned(),
SortOrder.unsorted(),
ImmutableMap.of("format-version", "3", "write.format.default", "PARQUET"),
hadoopTableLocation.toString());
String registered = "registered_srid_zero_into_custom_" + randomNameSuffix();
assertUpdate("CALL system.register_table(CURRENT_SCHEMA, '%s', '%s')"
.formatted(registered, hadoopTableLocation));
// ST_Point produces SRID=0; column is EPSG:3857
assertUpdate("INSERT INTO " + registered + " VALUES (1, ST_Point(-122.4, 37.7))", 1);
// The point comes back tagged with SRID=3857 even though we wrote a SRID=0 value.
assertThat(query("SELECT ST_SRID(geom) FROM " + registered))
.matches("VALUES 3857");
assertUpdate("DROP TABLE " + registered);
}
The user wrote WGS84-looking lon/lat values; downstream tooling now reads them as web-mercator metres. There's no error, no warning, and no way for a user to spot it in the SQL output.
Suggested fix: tighten the predicate to fail unless sourceSrid == targetSrid, treating SRID=0 as a real (mismatching) value rather than a wildcard. If users still want to opt into "trust me, my SRID-0 data is in the column's CRS", make that an explicit cast / function call (e.g. require the user to call ST_SetSRID(g, 3857) themselves), not the implicit default.
|
|
||
| if (icebergType instanceof Types.GeometryType geometryType) { |
There was a problem hiding this comment.
Note
This is Claude-generated analysis — verified by running a test against this PR branch (results below).
The bytes flowing through this geometry-write branch have already lost Z/M dimensions before they reach the sink. JtsGeometrySerde.serialize is hard-coded to new WKBWriter(2, true) (line 80 of lib/trino-geospatial-toolkit/.../JtsGeometrySerde.java), so any Trino-side serialization path — ST_GeometryFromText('POINT Z (...)'), ST_Point plus a future Z constructor, casts, anything that round-trips through this type's writeObject — silently truncates to 2D. By the time trinoType.getSlice(block, position) runs above, Z is gone.
Verified against this PR's branch (52e283b) with this test added to TestIcebergV3 — it fails, demonstrating the loss:
@Test
void testWriteThreeDimensionalGeometry_demonstratesZDrop()
{
try (TestTable table = newTrinoTable("test_z_drop_",
"(geom geometry) WITH (format = 'PARQUET', format_version = 3)")) {
assertUpdate("INSERT INTO " + table.getName() + " VALUES (ST_GeometryFromText('POINT Z (1 2 3)'))", 1);
// Expected POINT Z (1 2 3); actual POINT (1 2) — Z silently dropped at insert time.
assertThat(query("SELECT ST_AsText(geom) FROM " + table.getName()))
.matches("VALUES VARCHAR 'POINT Z (1 2 3)'");
}
}
Test output:
[Rows for query [SELECT ST_AsText(geom) FROM test_z_drop_...]]
Expecting actual:
(POINT (1 2))
to contain exactly in any order:
[(POINT Z (1 2 3))]
Read fidelity is fine — bytes from XYZ files written by other engines pass through wkbToEwkb unchanged (it only flips the type-code bit), so external XYZ data round-trips faithfully through SELECT g. The loss is strictly on writes Trino itself produces.
Suggested fix: derive the WKBWriter's outputDimension from the source Geometry's coordinate dimension (geometry.getCoordinate().getDimension() or check Z/M presence) instead of hard-coding 2. Iceberg v3 advertises Z/M support, so silently demoting users' 3D data to 2D is a real surprise. At minimum, raising an error when the sink receives a WKB type code outside 1..7 would convert silent loss into a clear failure.
Description
This PR is based on #27881, and only the last commit is relevent to Iceberg.
Geometry
geometry(crs)→ TrinoGeometryEPSG:3857→ SRID 3857,OGC:CRS84→ SRID 4326)Geography
geography(crs, algorithm)→ TrinoSphericalGeographyRelease notes
(X) Release notes are required, with the following suggested text: