Skip to content

[Iceberg v3] Geometry and Geography types#27893

Open
dain wants to merge 2 commits into
masterfrom
user/dain/iceberg-geo
Open

[Iceberg v3] Geometry and Geography types#27893
dain wants to merge 2 commits into
masterfrom
user/dain/iceberg-geo

Conversation

@dain
Copy link
Copy Markdown
Member

@dain dain commented Jan 11, 2026

Description

This PR is based on #27881, and only the last commit is relevent to Iceberg.

Geometry

geometry(crs) → Trino Geometry

  • CRS is converted to SRID (e.g., EPSG:3857 → SRID 3857, OGC:CRS84 → SRID 4326)
  • Default CRS (null) is treated as OGC:CRS84 (SRID 4326)
  • On read: SRID is injected into the WKB to produce EWKB
  • On write: SRID is validated against column CRS, then stripped from EWKB to produce WKB (SRID 0 is accepted for any CRS)

Geography

geography(crs, algorithm) → Trino SphericalGeography

  • CRS must be WGS84 (OGC:CRS84, EPSG:4326, or null for default)
  • Algorithm must be SPHERICAL (or null for default)
  • Non-WGS84 CRS is not supported
  • Non-spherical algorithms (e.g., VINCENTY) are not supported

Release notes

(X) Release notes are required, with the following suggested text:

## Iceberg
* Add support for reading and writing `geometry` and `geography` column types. ({issue}`issuenumber`)

@cla-bot cla-bot Bot added the cla-signed label Jan 11, 2026
@github-actions github-actions Bot added iceberg Iceberg connector hive Hive connector postgresql PostgreSQL connector labels Jan 11, 2026
@dain dain force-pushed the user/dain/iceberg-geo branch 2 times, most recently from bf401e7 to b2ccfe0 Compare January 11, 2026 05:07
@dain dain changed the title Add geometry and geography type support for Iceberg v3 [Iceberg v3] Geometry and Geography types Jan 11, 2026
@dain dain marked this pull request as ready for review January 11, 2026 07:43
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 2, 2026

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions github-actions Bot added the stale label Feb 2, 2026
@dain dain added the stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. label Feb 17, 2026
@dain dain force-pushed the user/dain/iceberg-geo branch from b2ccfe0 to 7dfe940 Compare March 25, 2026 18:52
@dain dain removed the stale label Mar 25, 2026
@dain dain force-pushed the user/dain/iceberg-geo branch 2 times, most recently from c5c66d5 to 29af654 Compare March 25, 2026 20:23
Comment on lines +203 to +210
int type = ewkb.getInt(1);
if (bigEndian) {
type = Integer.reverseBytes(type);
}
if ((type & EWKB_SRID_FLAG) == 0) {
return 0;
}
int srid = ewkb.getInt(5);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do a single getLong instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turned out more complex.

Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java Outdated
@dain
Copy link
Copy Markdown
Member Author

dain commented Mar 25, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 25, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

Adds EWKB/SRID utilities to JtsGeometrySerde (OGC_CRS84_SRID, EWKB flag, extractSrid, ewkbToWkb, crsToSrid, wkbToEwkb) with endianness-aware SRID read/write and CRS parsing. Integrates geospatial handling into the Iceberg connector: treats GEOMETRY/GEOGRAPHY as binary for file I/O, rewrites types to VARBINARY for writers/readers, strips SRIDs on write, injects/validates SRIDs on read (recursing into arrays/maps/rows), disables predicate pushdown for geospatial domains, and excludes geospatial columns from NDV stats. Adds tests and POM dependencies for geospatial modules and Parquet metadata/annotation round-trips.

Possibly related PRs

✨ Finishing Touches
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch user/dain/iceberg-geo

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.42.1)
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 7/8 reviews remaining, refill in 7 minutes and 30 seconds.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java (1)

172-178: ⚠️ Potential issue | 🟠 Major

Recursively rewrite geospatial types; current conversion misses nested shapes.

toFileSchema(...) / toFileTrinoType(...) only handle top-level geospatial columns. Nested geospatial types (e.g., array(geometry), row(... geography ...)) are left unchanged, so writer schemas/types can still contain unsupported geospatial primitives.

Proposed fix
@@
-        List<String> fileColumnNames = icebergSchema.columns().stream()
+        Schema fileSchema = toFileSchema(icebergSchema);
+        List<String> fileColumnNames = fileSchema.columns().stream()
                 .map(Types.NestedField::name)
                 .collect(toImmutableList());
-        List<Type> fileColumnTypes = icebergSchema.columns().stream()
+        List<Type> fileColumnTypes = fileSchema.columns().stream()
                 .map(column -> toTrinoType(column.type(), typeManager))
-                .map(IcebergFileWriterFactory::toFileTrinoType)
                 .collect(toImmutableList());
@@
-                    convert(toFileSchema(icebergSchema), "table"),
+                    convert(fileSchema, "table"),
@@
-            List<Types.NestedField> columnFields = icebergSchema.columns();
+            Schema fileSchema = toFileSchema(icebergSchema);
+            List<Types.NestedField> columnFields = fileSchema.columns();
@@
                     .map(Types.NestedField::type)
                     .map(type -> toTrinoType(type, typeManager))
-                    .map(IcebergFileWriterFactory::toFileTrinoType)
                     .collect(toImmutableList());
@@
-                    toOrcType(toFileSchema(icebergSchema)),
+                    toOrcType(fileSchema),
@@
-        List<Type> columnTypes = icebergSchema.columns().stream()
+        Schema fileSchema = toFileSchema(icebergSchema);
+        List<Type> columnTypes = fileSchema.columns().stream()
                 .map(column -> toTrinoType(column.type(), typeManager))
-                .map(IcebergFileWriterFactory::toFileTrinoType)
                 .collect(toImmutableList());
@@
-                toFileSchema(icebergSchema),
+                fileSchema,
                 columnTypes,
                 compressionCodec);
@@
-    private static Schema toFileSchema(Schema icebergSchema)
-    {
-        List<Types.NestedField> columns = icebergSchema.columns().stream()
-                .map(IcebergFileWriterFactory::toFileType)
-                .collect(toImmutableList());
-        return new Schema(columns);
-    }
+    private static Schema toFileSchema(Schema icebergSchema)
+    {
+        return new Schema(icebergSchema.columns().stream()
+                .map(IcebergFileWriterFactory::toFileType)
+                .collect(toImmutableList()));
+    }
@@
     private static Types.NestedField toFileType(Types.NestedField field)
     {
-        org.apache.iceberg.types.Type type = field.type();
-        if (type.typeId() == TypeID.GEOMETRY || type.typeId() == TypeID.GEOGRAPHY) {
-            // Replace geometry/geography with binary for file writing
-            return Types.NestedField.of(field.fieldId(), field.isOptional(), field.name(), Types.BinaryType.get(), field.doc());
-        }
-        return field;
-    }
-
-    /**
-     * Convert Trino type for file writing - geometry/geography become varbinary.
-     */
-    private static Type toFileTrinoType(Type type)
-    {
-        if (type instanceof GeometryType || type instanceof SphericalGeographyType) {
-            return VarbinaryType.VARBINARY;
-        }
-        return type;
+        org.apache.iceberg.types.Type rewritten = toFileType(field.type());
+        if (rewritten == field.type()) {
+            return field;
+        }
+        return Types.NestedField.of(field.fieldId(), field.isOptional(), field.name(), rewritten, field.doc());
+    }
+
+    private static org.apache.iceberg.types.Type toFileType(org.apache.iceberg.types.Type type)
+    {
+        return switch (type.typeId()) {
+            case GEOMETRY, GEOGRAPHY -> Types.BinaryType.get();
+            case STRUCT -> Types.StructType.of(type.asStructType().fields().stream()
+                    .map(IcebergFileWriterFactory::toFileType)
+                    .collect(toImmutableList()));
+            case LIST -> {
+                Types.ListType listType = type.asListType();
+                org.apache.iceberg.types.Type elementType = toFileType(listType.elementType());
+                yield listType.isElementRequired()
+                        ? Types.ListType.ofRequired(listType.elementId(), elementType)
+                        : Types.ListType.ofOptional(listType.elementId(), elementType);
+            }
+            case MAP -> {
+                Types.MapType mapType = type.asMapType();
+                org.apache.iceberg.types.Type keyType = toFileType(mapType.keyType());
+                org.apache.iceberg.types.Type valueType = toFileType(mapType.valueType());
+                yield mapType.isValueRequired()
+                        ? Types.MapType.ofRequired(mapType.keyId(), mapType.valueId(), keyType, valueType)
+                        : Types.MapType.ofOptional(mapType.keyId(), mapType.valueId(), keyType, valueType);
+            }
+            default -> type;
+        };
     }

Also applies to: 233-237, 312-315, 332-359

🧹 Nitpick comments (2)
plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java (1)

93-93: Consider reversing assertion order for clearer failure messages.

AssertJ convention is assertThat(actual).isEqualTo(expected). The current order places the expected value in assertThat, which would produce a confusing failure message like "expected: [actual] but was: [expected]".

💡 Suggested fix
-            assertThat(wktList.get(i)).isEqualTo(SPHERICAL_GEOGRAPHY.getObjectValue(block, i));
+            assertThat(SPHERICAL_GEOGRAPHY.getObjectValue(block, i)).isEqualTo(wktList.get(i));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java`
at line 93, The assertion currently uses
assertThat(wktList.get(i)).isEqualTo(SPHERICAL_GEOGRAPHY.getObjectValue(block,
i)), which puts the expected value in the assertThat() slot and yields confusing
failure messages; swap the operands so the actual value from
SPHERICAL_GEOGRAPHY.getObjectValue(block, i) is passed to assertThat(...) and
isEqualTo(...) compares to wktList.get(i) (i.e., use
assertThat(SPHERICAL_GEOGRAPHY.getObjectValue(block,
i)).isEqualTo(wktList.get(i))) to follow AssertJ convention.
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java (1)

1780-1790: Assert the geography payload, not just row presence.

This test still passes if the new sphericalgeography read/write path corrupts the payload, because it never inspects geog. Please round-trip a deterministic value instead of only checking that one row exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`
around lines 1780 - 1790, In testGeographyRoundTrip, change the assertion to
validate the actual geography payload instead of only row presence: insert a
deterministic geometry using to_spherical_geography(ST_Point(...)) (as already
done) and then query the stored column (e.g., SELECT ST_AsText(geog) or SELECT
geog = to_spherical_geography(ST_Point(...))) to assert the round-tripped value
equals the original; update the assertThat(...) call in testGeographyRoundTrip
to compare the retrieved geog text/binary to the expected deterministic value so
corruption would fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java`:
- Around line 358-374: transformGeometryColumnsForWrite currently only processes
top-level geometry channels and therefore skips SRID validation and EWKB→WKB
conversion for nested geospatial values; update it to recurse into complex
blocks (RowBlock, ArrayBlock, MapBlock) so nested geometry blocks are visited
and passed to transformGeometryBlockForWrite (or alternately add an early
validation that rejects nested geometry types). Specifically, modify
transformGeometryColumnsForWrite to detect complex Block instances and walk
their children recursively, applying transformGeometryBlockForWrite when a child
block corresponds to a geometry type (preserving position counts and offsets),
and ensure SRID lookup uses the same channel/type mapping used now so nested
geometries get validated and converted.

In
`@plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java`:
- Around line 1694-1710: getSridInjectionTransform currently only handles when
the projected column itself is a GeometryType, so selecting structured types
(ROW/ARRAY/MAP) that contain geometry skips SRID injection; modify
getSridInjectionTransform to inspect nested schema/type structures (RowType,
ArrayType, MapType) for GeometryType leaves by recursing the tableSchema
field.type and/or IcebergColumnHandle.getType and build a Block->Block transform
that traverses structured Blocks and calls injectSridIntoBlock for each nested
geometry leaf (preserve positions and offsets for arrays/maps and fields for
rows), or if recursion is undesirable, explicitly detect any nested GeometryType
and throw/return an error disallowing nested geospatial columns; use the
existing symbols getSridInjectionTransform, injectSridIntoBlock,
IcebergColumnHandle, tableSchema.findField and Types.{RowType, ListType,
MapType}/Types.GeometryType to locate and implement the recursion or validation.
- Around line 932-942: getFileReadType currently only converts top-level
GeometryType/SphericalGeographyType to VARBINARY, but for Avro you must recurse
into complex types so nested geospatial leaves are also read as VARBINARY:
update getFileReadType(Type columnType) to detect complex types (RowType,
ArrayType, MapType) and return a new type with the same structure where each
child/field/element/key/value type is replaced by calling getFileReadType
recursively (preserve field names and nullability for RowType, preserve map
key/value semantics for MapType, and element type for ArrayType); keep the
existing simple-case behavior for GeometryType and SphericalGeographyType and
leave other scalar types unchanged so behavior matches getOrcReadType.
- Around line 1715-1754: injectSridIntoBlock currently returns the original
block whenever it isn't a VariableWidthBlock, which allows dictionary-, RLE- or
lazy-backed geometry blocks to bypass SRID injection; update injectSridIntoBlock
to unwrap and handle those wrappers: if block is a DictionaryBlock, call
injectSridIntoBlock on dictionaryBlock.getDictionary() and then return a new
DictionaryBlock that reuses the original ids and positionCount but wraps the
transformed dictionary; if block is a RunLengthEncodedBlock, call
injectSridIntoBlock on runLengthEncodedBlock.getValue() and return a new
RunLengthEncodedBlock with the transformed value and the original positionCount;
if block is a LazyBlock, force/load the underlying block (lazyBlock.getBlock())
and recurse similarly; for any other block types continue to return the original
block. Ensure you still perform the existing VariableWidthBlock SRID
transformation (keeping function name injectSridIntoBlock and reusing the
existing VariableWidthBlock handling) and preserve nulls, offsets and original
wrapper semantics when reconstructing wrapper blocks.

---

Nitpick comments:
In
`@plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java`:
- Line 93: The assertion currently uses
assertThat(wktList.get(i)).isEqualTo(SPHERICAL_GEOGRAPHY.getObjectValue(block,
i)), which puts the expected value in the assertThat() slot and yields confusing
failure messages; swap the operands so the actual value from
SPHERICAL_GEOGRAPHY.getObjectValue(block, i) is passed to assertThat(...) and
isEqualTo(...) compares to wktList.get(i) (i.e., use
assertThat(SPHERICAL_GEOGRAPHY.getObjectValue(block,
i)).isEqualTo(wktList.get(i))) to follow AssertJ convention.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`:
- Around line 1780-1790: In testGeographyRoundTrip, change the assertion to
validate the actual geography payload instead of only row presence: insert a
deterministic geometry using to_spherical_geography(ST_Point(...)) (as already
done) and then query the stored column (e.g., SELECT ST_AsText(geog) or SELECT
geog = to_spherical_geography(ST_Point(...))) to assert the round-tripped value
equals the original; update the assertThat(...) call in testGeographyRoundTrip
to compare the retrieved geog text/binary to the expected deterministic value so
corruption would fail the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: cfe86063-c19c-407a-a2f2-c1f96b5aa41b

📥 Commits

Reviewing files that changed from the base of the PR and between 39fca46 and 29af654.

📒 Files selected for processing (14)
  • lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java
  • plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java
  • plugin/trino-iceberg/pom.xml
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/aggregation/IcebergThetaSketchForStats.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.java
  • plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java

@dain dain force-pushed the user/dain/iceberg-geo branch from 29af654 to 09bef94 Compare March 25, 2026 23:02
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 26, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java (1)

1684-1715: Exercise the custom-SRID data path, not just registration.

This test stops at metadata checks, so it never hits the EPSG:3857 read/write logic added in IcebergPageSourceProvider and IcebergPageSink. Adding one row and asserting ST_SRID(geom) = 3857 would cover the new behavior end to end.

Suggested test extension
         assertThat(query("DESCRIBE " + registered))
                 .matches("VALUES (VARCHAR 'id', VARCHAR 'integer', VARCHAR '', VARCHAR ''), " +
                         "(VARCHAR 'geom', VARCHAR 'Geometry', VARCHAR '', VARCHAR '')");
 
+        assertUpdate("INSERT INTO " + registered + " VALUES (1, ST_SetSRID(ST_Point(1, 2), 3857))", 1);
+        assertThat(query("SELECT ST_SRID(geom) FROM " + registered))
+                .matches("VALUES 3857");
+
         assertUpdate("DROP TABLE " + registered);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`
around lines 1684 - 1715, Extend TestIcebergV3.testGeometryWithCustomSrid to
exercise actual read/write of the EPSG:3857 geometry: after registering the
table (registered), INSERT one row with a geometry value into the registered
table (so IcebergPageSink is used) and then SELECT the geometry back asserting
ST_SRID(geom) = 3857 (and optionally validating the geometry value) to exercise
the IcebergPageSink/IcebergPageSourceProvider code paths for custom SRID
handling.
lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java (1)

275-295: Consider defensive check for already-present SRID flag.

If the input slice is EWKB (already has SRID flag set), the method would produce corrupted output by copying the old SRID as part of the geometry data. While the Javadoc clearly states the input should be "WKB", a defensive check would prevent silent data corruption from accidental misuse.

🛡️ Optional defensive check
     public static Slice wkbToEwkb(Slice wkb, int srid)
     {
         checkArgument(wkb.length() >= 5, "WKB too short");
         boolean bigEndian = wkb.getByte(0) == 0;
         int type = wkb.getInt(1);
         if (bigEndian) {
             type = Integer.reverseBytes(type);
         }
+        checkArgument((type & EWKB_SRID_FLAG) == 0, "Input already has SRID flag set (expected WKB, got EWKB)");
 
         // Add SRID flag
         int newType = type | EWKB_SRID_FLAG;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`
around lines 275 - 295, wkbToEwkb currently blindly injects an SRID flag and
copies bytes, which corrupts input if the incoming Slice already has EWKB SRID
set; add a defensive check at the start of wkbToEwkb by inspecting the parsed
type (the local variable type obtained from wkb.getInt(1) and possibly reversed)
and if (type & EWKB_SRID_FLAG) != 0 either throw an IllegalArgumentException or
simply return the original Slice unchanged (choose consistent behavior with the
codebase), with a clear message mentioning EWKB_SRID_FLAG and the method
JtsGeometrySerde.wkbToEwkb so callers know the input already contains an SRID
and no injection is needed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.java`:
- Around line 119-121: In the GEOMETRY branch of TypeConverter, validate the
geometry CRS before returning GEOMETRY: extract the CRS string from the Iceberg
geometry(crs) type, call crsToSrid(crs) to validate/parse it, and if crsToSrid
indicates an unsupported format or invalid EPSG, reject the type
deterministically (e.g., throw a TrinoException with NOT_SUPPORTED) instead of
silently returning GEOMETRY; only return GEOMETRY when crsToSrid succeeds.

---

Nitpick comments:
In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`:
- Around line 275-295: wkbToEwkb currently blindly injects an SRID flag and
copies bytes, which corrupts input if the incoming Slice already has EWKB SRID
set; add a defensive check at the start of wkbToEwkb by inspecting the parsed
type (the local variable type obtained from wkb.getInt(1) and possibly reversed)
and if (type & EWKB_SRID_FLAG) != 0 either throw an IllegalArgumentException or
simply return the original Slice unchanged (choose consistent behavior with the
codebase), with a clear message mentioning EWKB_SRID_FLAG and the method
JtsGeometrySerde.wkbToEwkb so callers know the input already contains an SRID
and no injection is needed.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`:
- Around line 1684-1715: Extend TestIcebergV3.testGeometryWithCustomSrid to
exercise actual read/write of the EPSG:3857 geometry: after registering the
table (registered), INSERT one row with a geometry value into the registered
table (so IcebergPageSink is used) and then SELECT the geometry back asserting
ST_SRID(geom) = 3857 (and optionally validating the geometry value) to exercise
the IcebergPageSink/IcebergPageSourceProvider code paths for custom SRID
handling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f363ea84-a267-4c8f-b184-cd96420316b7

📥 Commits

Reviewing files that changed from the base of the PR and between 09bef94 and a419dd1.

📒 Files selected for processing (14)
  • lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java
  • lib/trino-geospatial-toolkit/src/test/java/io/trino/geospatial/serde/TestGeometrySerialization.java
  • plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java
  • plugin/trino-iceberg/pom.xml
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.java
  • plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
✅ Files skipped from review due to trivial changes (5)
  • plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java
  • plugin/trino-iceberg/pom.xml
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
🚧 Files skipped from review as they are similar to previous changes (2)
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java

Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.java Outdated
@dain dain force-pushed the user/dain/iceberg-geo branch from a419dd1 to 2950790 Compare March 26, 2026 18:43
@raunaqmorarka
Copy link
Copy Markdown
Member

Can we separate the commits for geometry and geography ?
Now that parquet spec has added these types https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geometry, can we use those logical annotations ?

@dain
Copy link
Copy Markdown
Member Author

dain commented Mar 27, 2026

Can we separate the commits for geometry and geography ?

@raunaqmorarka Why? Seems like a bunch of what for not much win. This is not a big PR.

Now that parquet spec has added these types https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geometry, can we use those logical annotations ?

I can take a look.

@dain
Copy link
Copy Markdown
Member Author

dain commented Mar 27, 2026

Now that parquet spec has added these types https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geometry, can we use those logical annotations ?

I can take a look.

I took a look, and this is not as simple as adding a hard-coded logical annotation. The geometry/geography parameters would need to be carried through the Parquet schema annotations, and as far as I can tell that would require Parquet/Iceberg library support updates on our side to propagate those parameters end-to-end. So I think this is better handled as follow-up work in the Parquet schema conversion stack rather than folded into this PR.

@raunaqmorarka
Copy link
Copy Markdown
Member

Can we separate the commits for geometry and geography ?

@raunaqmorarka Why? Seems like a bunch of what for not much win. This is not a big PR.

We usually make separate commits for independent pieces of functionality. It helps with review, nicer git history etc. Unless these functionalities cannot exist separately, I think it makes sense for the commits to be separate. Hopefully claude/codex can do this without too much fuss.

@raunaqmorarka
Copy link
Copy Markdown
Member

Now that parquet spec has added these types https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geometry, can we use those logical annotations ?

I can take a look.

I took a look, and this is not as simple as adding a hard-coded logical annotation. The geometry/geography parameters would need to be carried through the Parquet schema annotations, and as far as I can tell that would require Parquet/Iceberg library support updates on our side to propagate those parameters end-to-end. So I think this is better handled as follow-up work in the Parquet schema conversion stack rather than folded into this PR.

I'm mainly worried about cross-engine compatibility. I see that Databricks/Spark doesn't support it yet, but SnowFlake does https://docs.snowflake.com/en/en/user-guide/tables-iceberg-data-types#label-tables-iceberg-v3-data-types
It would be nice if we can do this in a way that aligns with other iceberg/parquet reader/writers.

@wendigo wendigo force-pushed the user/dain/iceberg-geo branch from 2950790 to bc1fc0c Compare March 31, 2026 09:23
@wendigo
Copy link
Copy Markdown
Contributor

wendigo commented Mar 31, 2026

Rebased to solve conflicts

Comment thread plugin/trino-iceberg/pom.xml Outdated
Comment thread plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java Outdated
@wendigo wendigo force-pushed the user/dain/iceberg-geo branch from b59d0c6 to c5e5819 Compare March 31, 2026 18:09
Implement reading and writing of geometry columns in Iceberg tables
using the Iceberg v3 geometry type specification.
@dain dain force-pushed the user/dain/iceberg-geo branch from c5e5819 to 2964e92 Compare April 8, 2026 23:33
@dain
Copy link
Copy Markdown
Member Author

dain commented Apr 8, 2026

Now that parquet spec has added these types https://github.com/apache/parquet-format/blob/master/Geospatial.md#logical-types https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#geometry, can we use those logical annotations ?

I can take a look.

I took a look, and this is not as simple as adding a hard-coded logical annotation. The geometry/geography parameters would need to be carried through the Parquet schema annotations, and as far as I can tell that would require Parquet/Iceberg library support updates on our side to propagate those parameters end-to-end. So I think this is better handled as follow-up work in the Parquet schema conversion stack rather than folded into this PR.

I'm mainly worried about cross-engine compatibility. I see that Databricks/Spark doesn't support it yet, but SnowFlake does https://docs.snowflake.com/en/en/user-guide/tables-iceberg-data-types#label-tables-iceberg-v3-data-types It would be nice if we can do this in a way that aligns with other iceberg/parquet reader/writers.

I added a commit that does this.

@dain dain force-pushed the user/dain/iceberg-geo branch from 2964e92 to 52e283b Compare April 9, 2026 01:48
@dain dain requested a review from wendigo April 24, 2026 16:55
@raunaqmorarka
Copy link
Copy Markdown
Member

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 3, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@raunaqmorarka
Copy link
Copy Markdown
Member

@coderabbitai full review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 3, 2026

✅ Actions performed

Full review triggered.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java (1)

182-189: ⚡ Quick win

Strengthen geography round-trip coverage with an explicit CRS case.

Current assertions validate defaults; adding an explicit CRS scenario would directly guard the new metadata-preservation path.

Proposed test addition
 `@Test`
 void testGeographyLogicalTypeRoundTrip()
 {
     LogicalType logicalType = convertToLogicalType(geographyType());

     assertThat(logicalType.getGEOGRAPHY().getCrs()).isEqualTo("OGC:CRS84");
     assertThat(logicalType.getGEOGRAPHY().getAlgorithm().name()).isEqualTo("SPHERICAL");
     assertThat(getLogicalTypeAnnotation(logicalType)).isEqualTo(geographyType("OGC:CRS84", DEFAULT_ALGO));
 }
+
+@Test
+void testGeographyLogicalTypeRoundTripWithExplicitCrs()
+{
+    LogicalType logicalType = convertToLogicalType(geographyType("EPSG:4326", DEFAULT_ALGO));
+
+    assertThat(logicalType.getGEOGRAPHY().getCrs()).isEqualTo("EPSG:4326");
+    assertThat(logicalType.getGEOGRAPHY().getAlgorithm().name()).isEqualTo("SPHERICAL");
+    assertThat(getLogicalTypeAnnotation(logicalType)).isEqualTo(geographyType("EPSG:4326", DEFAULT_ALGO));
+}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java`
around lines 182 - 189, Add a second explicit-CRS assertion to
testGeographyLogicalTypeRoundTrip: call convertToLogicalType with
geographyType("EPSG:4326", DEFAULT_ALGO) (or another explicit CRS), then assert
that logicalType.getGEOGRAPHY().getCrs() equals the explicit CRS string, that
getGEOGRAPHY().getAlgorithm().name() matches the algorithm, and that
getLogicalTypeAnnotation(logicalType) equals the original
geographyType("EPSG:4326", DEFAULT_ALGO) to ensure metadata is preserved both
ways.
plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java (1)

35-48: ⚡ Quick win

Add geography and write-side geospatial mapping coverage here.

This file currently pins only the geometry read-path. The new geography validation and the toIcebergType* geometry/geography mappings are still untested, so a regression in those connector-visible branches would slip through unnoticed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java`
around lines 35 - 48, The tests only cover geometry read-paths; add unit tests
that also validate geography read-paths and the write-side mappings by
exercising toTrinoType for geography Types.GeographyType.of(...) and the inverse
mapping helpers (e.g., toIcebergType and any toIcebergTypeForWrite or similarly
named methods) to ensure both geometry and geography convert correctly on read
and write; update TestTypeConverter to include assertions for geography -> TRINO
type equality (similar to existing GEOMETRY assertions) and assert the
toIcebergType* mappings round-trip or produce expected Iceberg Types and throw
appropriate errors for invalid CRSs, referencing toTrinoType,
Types.GeometryType.of/Types.GeographyType.of, TESTING_TYPE_MANAGER, and the
toIcebergType* functions used in the codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`:
- Around line 203-205: The byte-order marker (ewkb.getByte(0)) is being treated
as little-endian for any non-zero value; update JtsGeometrySerde to validate the
marker is either 0 or 1 before parsing and fail fast otherwise. For each
location where you compute boolean bigEndian = ewkb.getByte(0) == 0 (and the
similar occurrences around the referenced lines), add an explicit check that the
marker byte is 0 or 1 and throw an IllegalArgumentException (or appropriate
parsing exception) with a clear message containing the invalid marker value when
it isn’t; then proceed to set bigEndian = (marker == 0). Ensure you apply this
same validation in all three spots (the occurrences around the existing boolean
bigEndian assignments).

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`:
- Around line 1810-1813: The test currently verifies only the WKT of the
geometry after round-tripping (via to_spherical_geography(ST_Point(...)) -> geog
-> to_geometry(geog) -> ST_AsText) which misses incorrect SRID; update the
assertion to also check the SRID by querying ST_SRID(to_geometry(geog)) for the
same row (use the same table reference from table.getName()) and assert it
equals the expected SRID (e.g., 4326) alongside the existing ST_AsText check
(adjust the assertThat/query invocation that wraps to_geometry(geog), ST_AsText,
and add ST_SRID to the returned columns or a separate assertThat on query(...)).
- Around line 1745-1749: The test in TestIcebergV3.java currently asserts
ST_SRID only for id = 1 (the point) but not for id = 2 (the polygon); add a
parallel assertion that runs query("SELECT ST_SRID(geom) FROM " +
table.getName() + " WHERE id = 2") and assert it matches the expected SRID
(4326) so that both geometry rows' SRIDs are validated.

---

Nitpick comments:
In
`@lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java`:
- Around line 182-189: Add a second explicit-CRS assertion to
testGeographyLogicalTypeRoundTrip: call convertToLogicalType with
geographyType("EPSG:4326", DEFAULT_ALGO) (or another explicit CRS), then assert
that logicalType.getGEOGRAPHY().getCrs() equals the explicit CRS string, that
getGEOGRAPHY().getAlgorithm().name() matches the algorithm, and that
getLogicalTypeAnnotation(logicalType) equals the original
geographyType("EPSG:4326", DEFAULT_ALGO) to ensure metadata is preserved both
ways.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java`:
- Around line 35-48: The tests only cover geometry read-paths; add unit tests
that also validate geography read-paths and the write-side mappings by
exercising toTrinoType for geography Types.GeographyType.of(...) and the inverse
mapping helpers (e.g., toIcebergType and any toIcebergTypeForWrite or similarly
named methods) to ensure both geometry and geography convert correctly on read
and write; update TestTypeConverter to include assertions for geography -> TRINO
type equality (similar to existing GEOMETRY assertions) and assert the
toIcebergType* mappings round-trip or produce expected Iceberg Types and throw
appropriate errors for invalid CRSs, referencing toTrinoType,
Types.GeometryType.of/Types.GeographyType.of, TESTING_TYPE_MANAGER, and the
toIcebergType* functions used in the codebase.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 78ecc971-5e5b-4bb9-af39-8adbcbb9e987

📥 Commits

Reviewing files that changed from the base of the PR and between 5a1c279 and 52e283b.

📒 Files selected for processing (20)
  • lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java
  • lib/trino-geospatial-toolkit/src/test/java/io/trino/geospatial/serde/TestGeometrySerialization.java
  • lib/trino-parquet/src/main/java/io/trino/parquet/ParquetMetadataConverter.java
  • lib/trino-parquet/src/test/java/io/trino/parquet/TestParquetMetadataConverter.java
  • plugin/trino-geospatial/src/test/java/io/trino/plugin/geospatial/TestSphericalGeoFunctions.java
  • plugin/trino-iceberg/pom.xml
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/ExpressionConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/GeoSpatialUtils.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSink.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergPageSourceProvider.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergTypes.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/TypeConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/HiveSchemaUtil.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/IcebergParquetSchemaConverter.java
  • plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/util/OrcTypeConverter.java
  • plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java
  • plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestTypeConverter.java
  • plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/util/TestIcebergParquetSchemaConverter.java

Comment on lines +203 to +205
boolean bigEndian = ewkb.getByte(0) == 0;
int type = ewkb.getInt(1);
if (bigEndian) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate EWKB/WKB byte-order marker before parsing.

Right now any marker other than 0 is treated as little-endian. Malformed inputs should fail fast instead of being parsed with a guessed endianness.

🛠️ Proposed fix
     public static int extractSrid(Slice ewkb)
     {
         if (ewkb.length() < 9) {
             return 0;
         }
-        boolean bigEndian = ewkb.getByte(0) == 0;
+        byte byteOrder = ewkb.getByte(0);
+        checkArgument(byteOrder == 0 || byteOrder == 1, "invalid WKB endianness: %s", byteOrder);
+        boolean bigEndian = byteOrder == 0;
         int type = ewkb.getInt(1);
         if (bigEndian) {
             type = Integer.reverseBytes(type);
         }
@@
     public static Slice ewkbToWkb(Slice ewkb)
     {
         if (ewkb.length() < 9) {
             return ewkb;
         }
-        boolean bigEndian = ewkb.getByte(0) == 0;
+        byte byteOrder = ewkb.getByte(0);
+        checkArgument(byteOrder == 0 || byteOrder == 1, "invalid WKB endianness: %s", byteOrder);
+        boolean bigEndian = byteOrder == 0;
         int type = ewkb.getInt(1);
         if (bigEndian) {
             type = Integer.reverseBytes(type);
         }
@@
     public static Slice wkbToEwkb(Slice wkb, int srid)
     {
         checkArgument(wkb.length() >= 5, "WKB too short");
-        boolean bigEndian = wkb.getByte(0) == 0;
+        byte byteOrder = wkb.getByte(0);
+        checkArgument(byteOrder == 0 || byteOrder == 1, "invalid WKB endianness: %s", byteOrder);
+        boolean bigEndian = byteOrder == 0;
         int type = wkb.getInt(1);
         if (bigEndian) {
             type = Integer.reverseBytes(type);
         }

Also applies to: 227-229, 281-282

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@lib/trino-geospatial-toolkit/src/main/java/io/trino/geospatial/serde/JtsGeometrySerde.java`
around lines 203 - 205, The byte-order marker (ewkb.getByte(0)) is being treated
as little-endian for any non-zero value; update JtsGeometrySerde to validate the
marker is either 0 or 1 before parsing and fail fast otherwise. For each
location where you compute boolean bigEndian = ewkb.getByte(0) == 0 (and the
similar occurrences around the referenced lines), add an explicit check that the
marker byte is 0 or 1 and throw an IllegalArgumentException (or appropriate
parsing exception) with a clear message containing the invalid marker value when
it isn’t; then proceed to set bigEndian = (marker == 0). Ensure you apply this
same validation in all three spots (the occurrences around the existing boolean
bigEndian assignments).

Comment on lines +1745 to +1749
assertThat(query("SELECT id, ST_AsText(geom) FROM " + table.getName() + " ORDER BY id"))
.matches("VALUES (1, VARCHAR 'POINT (1 2)'), (2, VARCHAR 'POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))')");

assertThat(query("SELECT ST_SRID(geom) FROM " + table.getName() + " WHERE id = 1"))
.matches("VALUES 4326");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert SRID for the polygon row too.

On Line 1745, the test only validates ST_SRID for the point row. A regression in the polygon path would still pass.

♻️ Proposed fix
                 assertThat(query("SELECT id, ST_AsText(geom) FROM " + table.getName() + " ORDER BY id"))
                         .matches("VALUES (1, VARCHAR 'POINT (1 2)'), (2, VARCHAR 'POLYGON ((0 0, 1 0, 1 1, 0 1, 0 0))')");
 
                 assertThat(query("SELECT ST_SRID(geom) FROM " + table.getName() + " WHERE id = 1"))
                         .matches("VALUES 4326");
+                assertThat(query("SELECT ST_SRID(geom) FROM " + table.getName() + " WHERE id = 2"))
+                        .matches("VALUES 4326");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`
around lines 1745 - 1749, The test in TestIcebergV3.java currently asserts
ST_SRID only for id = 1 (the point) but not for id = 2 (the polygon); add a
parallel assertion that runs query("SELECT ST_SRID(geom) FROM " +
table.getName() + " WHERE id = 2") and assert it matches the expected SRID
(4326) so that both geometry rows' SRIDs are validated.

Comment on lines +1810 to +1813
assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1);

assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName()))
.matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')");
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Verify the geography SRID, not just the text representation.

On Line 1812, to_geometry(geog) is only checked with ST_AsText, so a bad SRID would still be missed. Add an explicit ST_SRID assertion to cover the full round-trip.

♻️ Proposed fix
                 assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1);
 
                 assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName()))
                         .matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')");
+                assertThat(query("SELECT ST_SRID(to_geometry(geog)) FROM " + table.getName()))
+                        .matches("VALUES 4326");
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1);
assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName()))
.matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')");
assertUpdate("INSERT INTO " + table.getName() + " VALUES (1, to_spherical_geography(ST_Point(-122.4194, 37.7749)))", 1);
assertThat(query("SELECT id, ST_AsText(to_geometry(geog)) FROM " + table.getName()))
.matches("VALUES (1, VARCHAR 'POINT (-122.4194 37.7749)')");
assertThat(query("SELECT ST_SRID(to_geometry(geog)) FROM " + table.getName()))
.matches("VALUES 4326");
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV3.java`
around lines 1810 - 1813, The test currently verifies only the WKT of the
geometry after round-tripping (via to_spherical_geography(ST_Point(...)) -> geog
-> to_geometry(geog) -> ST_AsText) which misses incorrect SRID; update the
assertion to also check the SRID by querying ST_SRID(to_geometry(geog)) for the
same row (use the same table reference from table.getName()) and assert it
equals the expected SRID (e.g., 4326) alongside the existing ST_AsText check
(adjust the assertThat/query invocation that wraps to_geometry(geog), ST_AsText,
and add ST_SRID to the returned columns or a separate assertThat on query(...)).

}
case STRING -> ImmutableList.of(new OrcType(OrcTypeKind.STRING, ImmutableList.of(), ImmutableList.of(), Optional.empty(), Optional.empty(), Optional.empty(), attributes));
case FIXED, BINARY -> ImmutableList.of(new OrcType(OrcTypeKind.BINARY, ImmutableList.of(), ImmutableList.of(), Optional.empty(), Optional.empty(), Optional.empty(), attributes));
case FIXED, BINARY, GEOMETRY, GEOGRAPHY -> ImmutableList.of(new OrcType(OrcTypeKind.BINARY, ImmutableList.of(), ImmutableList.of(), Optional.empty(), Optional.empty(), Optional.empty(), attributes));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be doing

.put(ICEBERG_BINARY_TYPE, "GEOMETRY/GEOGRAPHY")

?

}
}

private static class SortOrderVisitor
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet spec is requiring these new tyes to have no min/max stats, I think we have to return SortOrder.UNKNOWN for Geometry/GeographyLogicalTypeAnnotation

Comment on lines +386 to +391
return switch (type.typeId()) {
case GEOMETRY -> true;
case LIST -> containsGeometry(type.asListType().elementType());
case MAP -> containsGeometry(type.asMapType().keyType()) || containsGeometry(type.asMapType().valueType());
case STRUCT -> type.asStructType().fields().stream().anyMatch(field -> containsGeometry(field.type()));
default -> false;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This is Claude-generated analysis — please verify the reproducer and reasoning before acting on it.

EWKB bytes can leak into GEOGRAPHY columns on write, producing files that aren't valid ISO WKB.

This switch only matches case GEOMETRY, and appendTransformedGeometryValueForWrite only handles Types.GeometryType, so transformGeometryColumnsForWrite skips GEOGRAPHY columns entirely and the block bytes flow to the writer unchanged.

That matters because SPHERICAL_GEOGRAPHY block bytes are produced by JtsGeometrySerde.serialize (WKBWriter(2, true)), which sets the EWKB SRID flag (0x20000000) when the JTS Geometry's SRID is non-zero. The read path here always seeds a non-zero SRID into geometry blocks (appendTransformedGeometryValueForReadwkbToEwkb(slice, getGeometrySrid(...)), defaulting to OGC_CRS84_SRID = 4326), so it's easy to construct a SPHERICAL_GEOGRAPHY value carrying that SRID and have it written verbatim.

Reproducer using only standard Trino SQL:

CREATE TABLE iceberg.s.geom_t (g geometry) WITH (format_version = 3);
INSERT INTO iceberg.s.geom_t VALUES (ST_GeometryFromText('POINT(1 2)'));

CREATE TABLE iceberg.s.geog_t (g geography) WITH (format_version = 3);
INSERT INTO iceberg.s.geog_t
  SELECT to_spherical_geography(g) FROM iceberg.s.geom_t;

The Parquet column backing geog_t then contains bytes like 01 01 00 00 20 ... — a Point with the EWKB SRID flag set and an SRID int prefix. A spec-conforming reader hits an unknown WKB type code on the very first record.

Suggested fix: extend both containsGeometry(org.apache.iceberg.types.Type) switches (here and in IcebergPageSourceProvider) to also match GEOGRAPHY, and add a Types.GeographyType branch to appendTransformedGeometryValueForWrite that strips the SRID via ewkbToWkb, symmetric to the GEOMETRY branch. The read-side counterpart should mirror the GEOMETRY path so a registered geography(EPSG:NNN) table round-trips its SRID into the block. A test along the lines of the reproducer (assert the on-disk bytes don't have the EWKB flag bit, or read with parquet-mr / pyiceberg and confirm WKB parses) would catch this and prevent regressions.

Comment on lines +415 to +419

// Validate SRID: fail only if both source and target are non-zero and different
if (sourceSrid != 0 && targetSrid != 0 && sourceSrid != targetSrid) {
throw new TrinoException(INVALID_FUNCTION_ARGUMENT,
"SRID mismatch: cannot write geometry with SRID %d into column %d with SRID %d".formatted(sourceSrid, columnIndex, targetSrid));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This is Claude-generated analysis — verified by running a test against this PR branch (results below).

The validation at line 417 treats sourceSrid == 0 as a wildcard, so a value with no embedded SRID can be inserted into a column declared with any non-default CRS without error. On read, getGeometrySrid injects the column's CRS-derived SRID, so the value the user gets back is silently re-tagged with a CRS the original data didn't have. Easy to do accidentally with ST_Point(...) / ST_GeometryFromText(...) (both produce SRID=0) against a column registered with e.g. EPSG:3857.

Verified against this PR's branch (52e283b) with this test added to TestIcebergV3 — it passes, demonstrating the bug:

@Test
void testWriteSridZeroIntoCustomCrsColumn_demonstratesSilentReassignment()
{
    String hadoopTableName = "hadoop_srid_zero_into_custom_" + randomNameSuffix();
    Path hadoopTableLocation = dataDirectory.resolve(hadoopTableName);

    Schema schema = new Schema(
            Types.NestedField.optional(1, "id", Types.IntegerType.get()),
            Types.NestedField.optional(2, "geom", Types.GeometryType.of("EPSG:3857")));

    new HadoopTables(new Configuration(false)).create(
            schema,
            PartitionSpec.unpartitioned(),
            SortOrder.unsorted(),
            ImmutableMap.of("format-version", "3", "write.format.default", "PARQUET"),
            hadoopTableLocation.toString());

    String registered = "registered_srid_zero_into_custom_" + randomNameSuffix();
    assertUpdate("CALL system.register_table(CURRENT_SCHEMA, '%s', '%s')"
            .formatted(registered, hadoopTableLocation));

    // ST_Point produces SRID=0; column is EPSG:3857
    assertUpdate("INSERT INTO " + registered + " VALUES (1, ST_Point(-122.4, 37.7))", 1);

    // The point comes back tagged with SRID=3857 even though we wrote a SRID=0 value.
    assertThat(query("SELECT ST_SRID(geom) FROM " + registered))
            .matches("VALUES 3857");

    assertUpdate("DROP TABLE " + registered);
}

The user wrote WGS84-looking lon/lat values; downstream tooling now reads them as web-mercator metres. There's no error, no warning, and no way for a user to spot it in the SQL output.

Suggested fix: tighten the predicate to fail unless sourceSrid == targetSrid, treating SRID=0 as a real (mismatching) value rather than a wildcard. If users still want to opt into "trust me, my SRID-0 data is in the column's CRS", make that an explicit cast / function call (e.g. require the user to call ST_SetSRID(g, 3857) themselves), not the implicit default.

Comment on lines +410 to +411

if (icebergType instanceof Types.GeometryType geometryType) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

This is Claude-generated analysis — verified by running a test against this PR branch (results below).

The bytes flowing through this geometry-write branch have already lost Z/M dimensions before they reach the sink. JtsGeometrySerde.serialize is hard-coded to new WKBWriter(2, true) (line 80 of lib/trino-geospatial-toolkit/.../JtsGeometrySerde.java), so any Trino-side serialization path — ST_GeometryFromText('POINT Z (...)'), ST_Point plus a future Z constructor, casts, anything that round-trips through this type's writeObject — silently truncates to 2D. By the time trinoType.getSlice(block, position) runs above, Z is gone.

Verified against this PR's branch (52e283b) with this test added to TestIcebergV3 — it fails, demonstrating the loss:

@Test
void testWriteThreeDimensionalGeometry_demonstratesZDrop()
{
    try (TestTable table = newTrinoTable("test_z_drop_",
            "(geom geometry) WITH (format = 'PARQUET', format_version = 3)")) {
        assertUpdate("INSERT INTO " + table.getName() + " VALUES (ST_GeometryFromText('POINT Z (1 2 3)'))", 1);

        // Expected POINT Z (1 2 3); actual POINT (1 2)  — Z silently dropped at insert time.
        assertThat(query("SELECT ST_AsText(geom) FROM " + table.getName()))
                .matches("VALUES VARCHAR 'POINT Z (1 2 3)'");
    }
}

Test output:

[Rows for query [SELECT ST_AsText(geom) FROM test_z_drop_...]]
Expecting actual:
  (POINT (1 2))
to contain exactly in any order:
  [(POINT Z (1 2 3))]

Read fidelity is fine — bytes from XYZ files written by other engines pass through wkbToEwkb unchanged (it only flips the type-code bit), so external XYZ data round-trips faithfully through SELECT g. The loss is strictly on writes Trino itself produces.

Suggested fix: derive the WKBWriter's outputDimension from the source Geometry's coordinate dimension (geometry.getCoordinate().getDimension() or check Z/M presence) instead of hard-coding 2. Iceberg v3 advertises Z/M support, so silently demoting users' 3D data to 2D is a real surprise. At minimum, raising an error when the sink receives a WKB type code outside 1..7 would convert silent loss into a clear failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector iceberg Iceberg connector postgresql PostgreSQL connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.

Development

Successfully merging this pull request may close these issues.

4 participants