Spark: Add vectorized Parquet reads for variant columns by nssalian · Pull Request #16292 · apache/iceberg

nssalian · 2026-05-11T19:48:51Z

Follow up to #16087 - fixing the Vectorized support for variant to remove the temporary patches.

Rationale for this Change

Variant columns currently force the entire table into row-at-a-time reads because the vectorized reader doesn't handle them. This PR fixes that by reading variant's metadata and value children as Arrow VarBinary batches.

What changes are included in this PR?

VectorizedReaderBuilder - adds variantVisitor() that creates a VectorizedVariantVisitor scoped to each variant column's Parquet path
VectorizedVariantVisitor - walks variant's internal structure, creates Arrow readers for metadata + value leaves
VectorizedArrowReader.VectorizedVariantReader - composes two child readers, delegates
read/setRowGroupInfo/setBatchSize/close
VectorHolder.VariantVectorHolder - carries both child holders through the batch pipeline
VariantColumnVector (new) - Spark ColumnVector implementing getChild(0) = value, getChild(1) = metadata per Spark's getVariant()
contract
ColumnVectorBuilder - dispatches VariantVectorHolder before isDummy() check
SparkBatch - allows unshredded variant through the batch-reads gate; tables with write.parquet.shred-variants=true fall back to row-at-a-time automatically
Tests - removed assumeThat(vectorized).isFalse() guards; all variant read tests now run with vectorization enabled
Both Spark 4.0 and 4.1

Limitations

Shredded variant columns are not vectorized - the SparkBatch gate detects write.parquet.shred-variants=true and falls back to row-at-a-time
Variant inside structs/lists/maps still falls back to row-at-a-time (pre-existing limitation for all complex types).

Are these changes tested?

TestSparkVariantRead (v4.0 + v4.1) - all tests now run with both vectorized=false and vectorized=true. Previously, the true value tests were skipped.
TestVariantShredding (v4.0 + v4.1) - table created with PARQUET_SHRED_VARIANTS=true so the SparkBatch handles the fallback
TestSnapshotTableProcedure (v4.0 + v4.1) - disables vectorization for variant imported from non-Iceberg files (missing VARIANT annotation)

Are there any user-facing changes?

Enabling vectorization will run for variant columns (non shredded) after this change.

…ader

Spark, Arrow: Add vectorized Parquet reads for variant columns

7c91c45

github-actions Bot added spark arrow labels May 11, 2026

nssalian changed the title ~~Spark, Arrow: Add vectorized Parquet reads for variant columns~~ Spark,Arrow: Add vectorized Parquet reads for variant columns May 11, 2026

nssalian changed the title ~~Spark,Arrow: Add vectorized Parquet reads for variant columns~~ Spark: Add vectorized Parquet reads for variant columns May 11, 2026

nssalian added 5 commits May 11, 2026 13:02

Modify test from previous patch

d8dd3c7

disable vectorized read for shredded variant

d0cf8dd

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

020cf79

…ader

Fall back to row-at-a-time reads for shredded variant columns

4164a4b

Merge remote-tracking branch 'apache/main' into variant-vectorized-re…

6a66e06

…ader

nssalian marked this pull request as ready for review May 13, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark: Add vectorized Parquet reads for variant columns#16292

Spark: Add vectorized Parquet reads for variant columns#16292
nssalian wants to merge 6 commits into
apache:mainfrom
nssalian:variant-vectorized-reader

nssalian commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nssalian commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this Change

What changes are included in this PR?

Limitations

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nssalian commented May 11, 2026 •

edited

Loading