Skip to content

Spark: Add vectorized Parquet reads for variant columns#16292

Open
nssalian wants to merge 6 commits into
apache:mainfrom
nssalian:variant-vectorized-reader
Open

Spark: Add vectorized Parquet reads for variant columns#16292
nssalian wants to merge 6 commits into
apache:mainfrom
nssalian:variant-vectorized-reader

Conversation

@nssalian
Copy link
Copy Markdown
Contributor

@nssalian nssalian commented May 11, 2026

Follow up to #16087 - fixing the Vectorized support for variant to remove the temporary patches.

Rationale for this Change

Variant columns currently force the entire table into row-at-a-time reads because the vectorized reader doesn't handle them. This PR fixes that by reading variant's metadata and value children as Arrow VarBinary batches.

What changes are included in this PR?

  • VectorizedReaderBuilder - adds variantVisitor() that creates a VectorizedVariantVisitor scoped to each variant column's Parquet path
  • VectorizedVariantVisitor - walks variant's internal structure, creates Arrow readers for metadata + value leaves
  • VectorizedArrowReader.VectorizedVariantReader - composes two child readers, delegates
    read/setRowGroupInfo/setBatchSize/close
  • VectorHolder.VariantVectorHolder - carries both child holders through the batch pipeline
  • VariantColumnVector (new) - Spark ColumnVector implementing getChild(0) = value, getChild(1) = metadata per Spark's getVariant()
    contract
  • ColumnVectorBuilder - dispatches VariantVectorHolder before isDummy() check
  • SparkBatch - allows unshredded variant through the batch-reads gate; tables with write.parquet.shred-variants=true fall back to row-at-a-time automatically
  • Tests - removed assumeThat(vectorized).isFalse() guards; all variant read tests now run with vectorization enabled
  • Both Spark 4.0 and 4.1

Limitations

  • Shredded variant columns are not vectorized - the SparkBatch gate detects write.parquet.shred-variants=true and falls back to row-at-a-time
  • Variant inside structs/lists/maps still falls back to row-at-a-time (pre-existing limitation for all complex types).

Are these changes tested?

  • TestSparkVariantRead (v4.0 + v4.1) - all tests now run with both vectorized=false and vectorized=true. Previously, the true value tests were skipped.
  • TestVariantShredding (v4.0 + v4.1) - table created with PARQUET_SHRED_VARIANTS=true so the SparkBatch handles the fallback
  • TestSnapshotTableProcedure (v4.0 + v4.1) - disables vectorization for variant imported from non-Iceberg files (missing VARIANT annotation)

Are there any user-facing changes?

  • Enabling vectorization will run for variant columns (non shredded) after this change.

@nssalian nssalian changed the title Spark, Arrow: Add vectorized Parquet reads for variant columns Spark,Arrow: Add vectorized Parquet reads for variant columns May 11, 2026
@nssalian nssalian changed the title Spark,Arrow: Add vectorized Parquet reads for variant columns Spark: Add vectorized Parquet reads for variant columns May 11, 2026
@nssalian nssalian marked this pull request as ready for review May 13, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant