Skip to content

#15510 Parquet: Support row group skipping for shredded variant columns#16133

Draft
steveloughran wants to merge 4 commits into
apache:mainfrom
steveloughran:pr/variant-rowgroups
Draft

#15510 Parquet: Support row group skipping for shredded variant columns#16133
steveloughran wants to merge 4 commits into
apache:mainfrom
steveloughran:pr/variant-rowgroups

Conversation

@steveloughran
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran commented Apr 27, 2026

ParquetMetricsRowGroupFilter.compareVariant() implements comparisons for variants, including NaN, null.

  • New ParquetVariantUtil splitter regexp can't cope with columns called ]. That's OK as normalisation forbids that and empty paths
  • Copilot wrote the tests so it's over-verbose, but thorough.
  • including NaN behaviour and string truncation on max values.

Fixes #15510

Testing notes

Full branch with this, Qlong's api changes and the benchmark is https://github.com/steveloughran/iceberg/tree/pr/variant-rowgroups-benchmark
Does need a matching spark version.

@steveloughran steveloughran marked this pull request as draft April 27, 2026 16:55
@steveloughran
Copy link
Copy Markdown
Contributor Author

@steveloughran steveloughran force-pushed the pr/variant-rowgroups branch 2 times, most recently from a10c7e5 to 8474d0d Compare April 29, 2026 13:10
@steveloughran steveloughran force-pushed the pr/variant-rowgroups branch from f39d3ee to 1e0f654 Compare May 11, 2026 15:52
…kipping

copilot's solution to why pushdown wasn't working, independent of
qlong's apache#15385

I plan to take qlong's and pull what is extra from this one.
ParquetMetricsRowGroupFilter.compareVariant() implements comparisons
for variants, including NaN, null, uuid, set membership.

* New ParquetVariantUtil splitter regexp can't cope with columns called ].
  That's OK as normalisation forbids that and empty paths
* Copilot wrote the tests so it's over-verbose, but thorough.
* including NaN behaviour and string truncation on max values.
* lazy build of variant mapping info, cached for entire file.

There's no concurrency handling here in the build up of that lazy structure,
once happy with the design it'll need to be locked down better.

TestSparkVariantFilterPushDown to test variant pushdown through spark
This needs a modified spark, currently.

A counter is exported from ParquetMetricsRowGroupFilter to assist here;
it should be culled eventually.
Change-Id: I0be25227711dc9c605a5070cfdfea40ffda53674
@steveloughran steveloughran force-pushed the pr/variant-rowgroups branch from 1e0f654 to f4c129d Compare May 14, 2026 14:54
@steveloughran
Copy link
Copy Markdown
Contributor Author

big milestone reached today w/ spark and parquet all lined up. Got a stack trace saying can't deser variant as the bytebuffer is in the wrong order. This could potentially mean that predicate pushdown is working down but some bytebuffer is OOO.

Or it is something unrelated. New stack though

java.lang.IllegalArgumentException: Unsupported byte order: big endian
	at org.apache.iceberg.relocated.com.google.common.base.Preconditions.checkArgument(Preconditions.java:141)
	at org.apache.iceberg.variants.SerializedMetadata.from(SerializedMetadata.java:43)
	at org.apache.iceberg.variants.VariantMetadata.from(VariantMetadata.java:54)
	at org.apache.iceberg.variants.Variant.from(Variant.java:40)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator.parseBounds(InclusiveMetricsEvaluator.java:635)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator$MetricsEvalVisitor.extractLowerBound(InclusiveMetricsEvaluator.java:602)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator$MetricsEvalVisitor.lowerBound(InclusiveMetricsEvaluator.java:543)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator$MetricsEvalVisitor.eq(InclusiveMetricsEvaluator.java:305)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator$MetricsEvalVisitor.eq(InclusiveMetricsEvaluator.java:83)
	at org.apache.iceberg.expressions.ExpressionVisitors$BoundVisitor.predicate(ExpressionVisitors.java:283)
	at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:390)
	at org.apache.iceberg.expressions.ExpressionVisitors.visitEvaluator(ExpressionVisitors.java:409)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator$MetricsEvalVisitor.eval(InclusiveMetricsEvaluator.java:108)
	at org.apache.iceberg.expressions.InclusiveMetricsEvaluator.eval(InclusiveMetricsEvaluator.java:77)
	at org.apache.iceberg.ManifestReader.lambda$entries$0(ManifestReader.java:274)
	at org.apache.iceberg.io.CloseableIterable$5.shouldKeep(CloseableIterable.java:156)
	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:66)
	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
	at org.apache.iceberg.io.FilterIterator.advance(FilterIterator.java:64)
	at org.apache.iceberg.io.FilterIterator.hasNext(FilterIterator.java:49)
	at org.apache.iceberg.io.CloseableIterable$7$1.hasNext(CloseableIterable.java:214)
	at org.apache.iceberg.io.CloseableIterable$ConcatCloseableIterable$ConcatCloseableIterator.hasNext(CloseableIterable.java:276)
	at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.tasks(SparkPartitioningAwareScan.java:185)
	at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.taskGroups(SparkPartitioningAwareScan.java:210)
	at org.apache.iceberg.spark.source.SparkPartitioningAwareScan.outputPartitioning(SparkPartitioningAwareScan.java:113)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:45)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$$anonfun$partitioning$1.applyOrElse(V2ScanPartitioningAndOrdering.scala:43)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:491)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:107)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:491)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:360)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:356)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1264)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1263)
	at org.apache.spark.sql.catalyst.plans.logical.Filter.mapChildren(basicLogicalOperators.scala:335)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:360)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:356)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1264)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1263)
	at org.apache.spark.sql.catalyst.plans.logical.Project.mapChildren(basicLogicalOperators.scala:73)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:360)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:356)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1264)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1263)
	at org.apache.spark.sql.catalyst.plans.logical.Sort.mapChildren(basicLogicalOperators.scala:903)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:360)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:356)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1264)
	at org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1263)
	at org.apache.spark.sql.catalyst.plans.logical.Project.mapChildren(basicLogicalOperators.scala:73)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:496)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:360)
	at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:356)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:37)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:467)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.partitioning(V2ScanPartitioningAndOrdering.scala:43)
	at org.apache.spark.sql.execution.datasources.v2.V2ScanPartitioningAndOrdering$.$anonfun$apply$1(V2ScanPartitioningAndOrdering.scala:36)

Change-Id: I6f402fe1e5bf65e51d79010efeb64e2c549e1d5f
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support row group skipping for shredded variant columns

1 participant