From 7d9db3f9b7fbd8b199b0d88d120637a6a559618a Mon Sep 17 00:00:00 2001 From: Yuya Ebihara Date: Mon, 6 Apr 2026 21:32:50 +0900 Subject: [PATCH] Remove spark-variant-compatibility.md --- spark-variant-compatibility.md | 229 --------------------------------- 1 file changed, 229 deletions(-) delete mode 100644 spark-variant-compatibility.md diff --git a/spark-variant-compatibility.md b/spark-variant-compatibility.md deleted file mode 100644 index 3c9e24a3efb3..000000000000 --- a/spark-variant-compatibility.md +++ /dev/null @@ -1,229 +0,0 @@ -# Spark Variant Compatibility - -This note records the current Spark/Iceberg compatibility story for `VARIANT` in this branch. - -## What Trino Implements - -Trino implements Iceberg `VARIANT` support in this branch. - -On the Trino side, we currently verify Spark interop through product tests in: - -- `testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java` -- `testing/trino-product-tests-launcher/src/main/java/io/trino/tests/product/launcher/env/environment/EnvSinglenodeSparkIcebergRest.java` - -Those tests currently cover: - -- `Trino writes -> Spark reads` for `AVRO` -- `Trino writes -> Spark reads` for `PARQUET` -- `Spark writes -> Trino reads` for `AVRO` -- `Spark writes -> Trino reads` for `PARQUET` - -Trino does not expose Iceberg Hadoop catalog as a first-class catalog type. The supported catalog types are defined in: - -- `plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/CatalogType.java` - -There is no `HADOOP` entry there. - -However, Trino can still interoperate with Hadoop-catalog Iceberg tables after registration by location. That path is -exercised in: - -- `plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergRegisterTableProcedure.java` - -Specifically, `testRegisterHadoopTableAndRead()` shows Trino registering and reading a Hadoop-created Iceberg table. - -## What Spark Implements - -Spark itself has native `variant` support in Spark 4: - -- [Spark `VariantType`](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala) - -The main compatibility limitations here are not about Spark SQL syntax. They are about: - -- which Iceberg catalog path is being used -- which Iceberg REST server version is being used -- which file formats Spark Iceberg can actually read and write for `VARIANT` -- which Iceberg variant primitive encodings Spark's runtime understands - -## Catalog Support - -### Hive Metastore catalog - -Spark `VARIANT` table creation through the Iceberg Hive Metastore path is not supported. - -Reason: - -- upstream Iceberg Hive schema conversion still has no `VARIANT` mapping and throws unsupported-type errors - -Source: - -- [Iceberg `HiveSchemaUtil.java`](https://github.com/apache/iceberg/blob/main/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java) - -This is an upstream Iceberg Hive-catalog limitation, not a Trino syntax issue. - -### REST catalog - -Spark `VARIANT` works with the Iceberg REST catalog only if the REST server is new enough. - -Old REST server behavior: - -- older Iceberg schema parsing used `Types.fromPrimitiveString(...)` -- that rejects `variant` - -Source: - -- [Iceberg 1.6.0 `SchemaParser.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.6.0/core/src/main/java/org/apache/iceberg/SchemaParser.java) - -Newer REST server behavior: - -- newer Iceberg schema parsing uses `Types.fromTypeName(...)` -- that accepts `variant` - -Source: - -- [Iceberg 1.10.1 `SchemaParser.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.10.1/core/src/main/java/org/apache/iceberg/SchemaParser.java) - -In our test environment, the original REST setup was on the older `tabulario/iceberg-rest:1.5.0` line, which is why -Spark table creation failed there. We now use a newer local REST image in the product test environment. - -### Hadoop/storage-based catalog - -Spark supports `VARIANT` with Iceberg's Hadoop catalog. - -Here "storage-based catalog" means Iceberg's Hadoop catalog, where table metadata is stored directly in the warehouse -instead of going through Hive Metastore or REST. - -Upstream evidence: - -- Iceberg's own Spark variant test explicitly uses a Hadoop catalog to avoid Hive schema conversion - -Source: - -- [Iceberg `TestSparkVariantRead.java`](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkVariantRead.java) - -## File Format Support - -### AVRO - -This is implemented in Trino and currently works in our Spark compatibility tests. - -### PARQUET - -This is implemented in Trino and currently works in our Spark compatibility tests. - -### ORC - -Spark does not currently support the `VARIANT` ORC path well enough for us to include it in the compatibility matrix. - -The important statement here is about Spark/Iceberg support, not Trino support: - -- Trino implements Iceberg `VARIANT` -- Spark Iceberg ORC read/write support for `VARIANT` is the part that is not working - -Why we consider it unsupported: - -- the Spark Iceberg ORC reader/writer code paths do not contain variant-aware handling analogous to the variant-specific - Spark tests -- in local product testing, Spark ORC variant interop failed in those code paths - -Sources: - -- [Iceberg `SparkOrcReader.java`](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java) -- [Iceberg `SparkOrcWriter.java`](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java) - -So the current state is: - -- `AVRO`: yes -- `PARQUET`: yes -- `ORC`: no, because Spark/Iceberg ORC support for `VARIANT` is not there yet in a usable form for interop - -## Variant Primitive Encodings - -Trino follows the Iceberg variant encoding set here. The compatibility gap is that Spark 4's variant runtime only -understands a subset of the Iceberg variant primitive encodings. - -### Currently verified in both directions - -These values are currently covered by product tests for both `Trino writes -> Spark reads` and -`Spark writes -> Trino reads`: - -- `null` -- variant `null` -- `boolean` -- `tinyint` -- `smallint` -- `integer` -- `bigint` -- `real` -- `double` -- `decimal` -- `varchar` -- `date` -- array values - -### Currently verified only for `Trino writes -> Spark reads` - -These values are already covered when Trino writes Iceberg `VARIANT` data and Spark reads it, but are not yet in the -verified Spark-written overlap: - -- `varbinary` -- map/object values -- row/object values -- `uuid` - -### Iceberg variant primitive encodings that Spark does not currently understand - -These Iceberg variant primitive encodings are not currently understood by the Spark 4.0 variant runtime shipped in our -test image: - -- `TIME_NTZ_MICROS` -- `TIMESTAMP_UTC_NANOS` -- `TIMESTAMP_NTZ_NANOS` - -Why: - -1. Iceberg 1.10.1 itself defines these physical types in its variant model: - - [Iceberg `PhysicalType.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.10.1/api/src/main/java/org/apache/iceberg/variants/PhysicalType.java) - - [Iceberg `Primitives.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.10.1/api/src/main/java/org/apache/iceberg/variants/Primitives.java) -2. Trino's variant header numbering matches those same physical encodings: - - `core/trino-spi/src/main/java/io/trino/spi/variant/Header.java` -3. Spark's runtime `VariantUtil` only recognizes the primitive IDs it knows about and throws - `UNKNOWN_PRIMITIVE_TYPE_IN_VARIANT` for the unknown ones: - - [Spark `VariantUtil.java`](https://github.com/apache/spark/blob/master/common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java) - -In the Spark 4.0.0 runtime shipped in our test image, the unsupported IDs line up with Trino's numbering exactly: - -- `17` -> `TIME_NTZ_MICROS` -- `18` -> `TIMESTAMP_UTC_NANOS` -- `19` -> `TIMESTAMP_NTZ_NANOS` - -This is why some valid Iceberg/Trino-written `VARIANT` values still cannot be read by Spark even when the catalog and -file format are otherwise supported. - -## Summary - -The short version is: - -- Trino implements Iceberg `VARIANT` -- Spark also implements `variant` -- Hive Metastore catalog is still blocked upstream in Iceberg -- REST catalog works only with a new enough Iceberg REST server -- Hadoop/storage-based catalog works on the Spark side -- ORC is currently blocked by Spark/Iceberg support, not by Trino -- Spark still does not understand every Iceberg variant primitive encoding - -## References - -- [Iceberg `HiveSchemaUtil.java`](https://github.com/apache/iceberg/blob/main/hive-metastore/src/main/java/org/apache/iceberg/hive/HiveSchemaUtil.java) -- [Iceberg 1.6.0 `SchemaParser.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.6.0/core/src/main/java/org/apache/iceberg/SchemaParser.java) -- [Iceberg 1.10.1 `SchemaParser.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.10.1/core/src/main/java/org/apache/iceberg/SchemaParser.java) -- [Iceberg `TestSparkVariantRead.java`](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/test/java/org/apache/iceberg/spark/sql/TestSparkVariantRead.java) -- [Iceberg `SparkOrcReader.java`](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcReader.java) -- [Iceberg `SparkOrcWriter.java`](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java) -- [Iceberg `PhysicalType.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.10.1/api/src/main/java/org/apache/iceberg/variants/PhysicalType.java) -- [Iceberg `Primitives.java`](https://github.com/apache/iceberg/blob/apache-iceberg-1.10.1/api/src/main/java/org/apache/iceberg/variants/Primitives.java) -- [Spark `VariantUtil.java`](https://github.com/apache/spark/blob/master/common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java) -- [Spark `VariantType.scala`](https://github.com/apache/spark/blob/master/sql/api/src/main/scala/org/apache/spark/sql/types/VariantType.scala) -- `core/trino-spi/src/main/java/io/trino/spi/variant/Header.java` -- `plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/CatalogType.java` -- `plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergRegisterTableProcedure.java` -- `testing/trino-product-tests/src/main/java/io/trino/tests/product/iceberg/TestIcebergSparkCompatibility.java`