[SPARK-56521][SQL] Support PartitionPredicate in runtime filters by szehon-ho · Pull Request #55382 · apache/spark

szehon-ho · 2026-04-17T01:22:49Z

What changes were proposed in this pull request?

This PR introduces PartitionPredicate support in runtime filters for DataSource V2 scans. Currently, PartitionPredicate is only used in static filter pushdown and metadata-only delete paths. This extends the same mechanism to runtime filters (Dynamic Partition Pruning and scalar subqueries).

Changes:

SupportsRuntimeV2Filtering: Added supportsIterativeFiltering() and pushedPredicates() default methods. When a scan returns true from supportsIterativeFiltering(), Spark may call filter() multiple times — first with translated V2 predicates, then with PartitionPredicate instances derived from runtime filters. The pushedPredicates() method (mirroring SupportsPushDownV2Filters) allows Spark to determine which predicates were already accepted in the first pass, avoiding duplicate pushdown.
BatchScanExec: After the existing runtime filter pushdown, if the scan supports iterative filtering, derives PartitionPredicate instances from DPP expressions and literalized scalar subqueries and pushes them in a second filter() call.
PushDownUtils: Refactored pushRuntimeFilters() to track which runtime filter expressions were translated to V2 predicates. Uses pushedPredicates() to exclude filters already accepted in the first pass from PartitionPredicate derivation. Candidates are further gated by filterAttributes() — only runtime filters whose referenced columns are declared in filterAttributes() are eligible for PartitionPredicate derivation, consistent with PartitionPruning's planning-time check.

Why are the changes needed?

Runtime filters (DPP and scalar subqueries) currently push V2 predicates to connectors, but connectors have no way to receive partition-level predicates with evaluable functions. PartitionPredicate wraps a Catalyst expression that connectors can evaluate directly against partition keys, enabling more efficient partition pruning at runtime without needing to translate expressions into the connector's native predicate format.

The pushedPredicates() method is needed to prevent the same logical filter from being pushed twice — once as a translated V2 predicate and again as a PartitionPredicate. The filterAttributes() gate ensures that only filters on declared filterable columns are considered, aligning runtime behavior with the static planning-time checks in PartitionPruning.

This is a sub-task of the DSV2 Enhanced Partition Stats Filtering umbrella (SPARK-55596).

Does this PR introduce any user-facing change?

Yes. Connectors implementing SupportsRuntimeV2Filtering can now:

Override supportsIterativeFiltering() to return true and receive PartitionPredicate instances via filter() during runtime filtering.
Override pushedPredicates() to report which predicates were accepted, so Spark avoids redundant pushdown.

This is an opt-in API addition; existing connectors are unaffected.

How was this patch tested?

Added DataSourceV2EnhancedRuntimePartitionFilterSuite with 12 numbered test cases (and subcases) covering all combinations of:

DPP: single partition column, non-first partition column (multi-column)
Scalar subquery: translatable, untranslatable (RLIKE, UDF), complex expressions, multiple partition columns
First-pass acceptance: DPP translated+accepted, scalar translated+accepted (no PartitionPredicate generated)
Mixed scenarios: one filter accepted in first pass + one untranslatable leading to PartitionPredicate
Negative tests: scalar subquery on data column (no PartitionPredicate), iterative filtering disabled, partition column not in filterAttributes()

Supporting test infrastructure: InMemoryEnhancedRuntimePartitionFilterTable (with configurable accept-v2-predicates and filter-attributes table properties) and InMemoryTableEnhancedRuntimePartitionFilterCatalog.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor (Claude Opus 4)

…ters.nonEmpty, simplify partPredicates

…panion object Extract runtime filter pushing logic from filteredPartitions into a companion object method with a pattern match guard, removing the asInstanceOf cast.

Move the runtime filter pushing logic from the BatchScanExec companion object to PushDownUtils, co-locating it with the related partition predicate helpers.

Add a pushedPredicates() API to SupportsRuntimeV2Filtering, mirroring SupportsPushDownV2Filters. Use it in pushRuntimeFilters to exclude already-pushed predicates from the second pass and to determine whether replanning is needed.

…pushedPredicates dedup, and comprehensive tests - Use pushedPredicates() to avoid deriving PartitionPredicates from runtime filters whose V2 translation was already accepted in the first filter() pass, preventing duplicate pushdown. - Gate PartitionPredicate candidates on filterAttributes(), consistent with PartitionPruning's planning-time check, using ExprId-based AttributeSet.subsetOf comparison. - Reorganize test suite into 12 numbered cases (with subcases) covering all combinations of DPP/scalar, translated/untranslatable, accepted/rejected, partition/data column, and filterAttributes. - Add configurable test table properties (accept-v2-predicates, filter-attributes) for targeted scenario testing.

cloud-fan

I think there's a regression here that will affect every existing SupportsRuntimeV2Filtering / SupportsRuntimeFiltering implementation. The PR description says "existing connectors are unaffected", but reading pushRuntimeFilters and BatchScanExec.filteredPartitions together, the new code path decides whether to re-plan partitions based on pushedPredicates().nonEmpty — and pushedPredicates() has a default that returns an empty array. Existing connectors don't override it, so filtered is false, scan.toBatch.planInputPartitions() is never re-called, and inputPartitions (lazily evaluated before filter()) is what BatchScanExec scans. The connector's filter() side-effect is effectively dropped. Details inline on F1.

Existing DPP V2 test suites don't catch this because they check runtimeFilters on BatchScanExec and final query answers — post-scan FilterExec preserves correctness. Partition pruning effectiveness is not asserted for V2. Case 11 in the new suite looks like it covers this, but the assertPushedPartitionPredicates helper falls through to Seq.empty for any scan class other than the new test table, so its == 0 assertion is trivially true (F5).

Other findings inline are smaller: a redundant/weaker runtime filterAttributes re-check (F2), a naming inconsistency with the sibling static interface (F3), a missing contract detail in the class Javadoc (F4), a scan-local conf leak in the test (F7), a Scaladoc imprecision (F6), and a minor double-call (F8).

cloud-fan · 2026-04-22T11:42:37Z

+          }
+        }
+
+        filterableScan.pushedPredicates().nonEmpty


[BLOCKING] The old BatchScanExec always re-called scan.toBatch.planInputPartitions() whenever filter() had been invoked; this return value is meant to replace that signal. But pushedPredicates() is a newly added default-returning-empty method. Every existing SupportsRuntimeV2Filtering / SupportsRuntimeFiltering implementation that doesn't override it will have this return false, causing BatchScanExec.filteredPartitions to fall into the else branch and use the pre-filter inputPartitions — the connector's filter() side effect (partitions pruned in its internal state) is then invisible to Spark. This breaks runtime partition pruning for every existing V2 runtime-filter implementation (including the in-repo InMemoryBatchScan and InMemoryV2FilterBatchScan).

The signal should be "was filter() actually called", not "did the scan self-report". Something like:

var filterCalled = false if (filtersToTranslated.nonEmpty) { filterableScan.filter(filtersToTranslated.values.toArray) filterCalled = true } if (filterableScan.supportsIterativeFiltering()) { // ... if (partPredicates.nonEmpty) { filterableScan.filter(partPredicates.toArray) filterCalled = true } } filterCalled

Please also add a regression test: a scan that doesn't override pushedPredicates() and asserts its filter()-driven partition pruning still takes effect (e.g., via partition count on the resulting batch).

i think this is overstated, there's actually a API 'supportsIterativeFiltering' that is default false, so it should not affect existing connector.

The rule is now, if the connector returns 'true', it needs to maintain pushedPredicates()

but that being said, i did add a test for such a connector (that overrides supportsIterativeFiltering=true, but doesn't implement pushedPredicates. the expected behavior is then it gets duplicate predicates in the second round.

cloud-fan · 2026-04-22T11:42:37Z

+          val pushed = filterableScan.pushedPredicates().toSet
+          val candidates = runtimeFilters.filter { f =>
+            !filtersToTranslated.get(f).exists(pushed.contains) &&
+              f.references.subsetOf(filterAttrs)


These candidates are already constrained at planning time: DPP filters in PartitionPruning.scala:82-89 require resExp.references.subsetOf(filterAttrs) via V2ExpressionUtils.resolveRefs; scalar subquery filters in DataSourceV2Strategy.scala:168-173 require f.references.subsetOf(relation.runtimeFilterAttrs) — both using the proper resolver. The runtime reconstruction here uses r.fieldNames.head + output.find(resolver), which drops multi-part paths for nested partition fields and is redundant with planning-time filtering. Can we drop this re-check and rely on the planning-time filters? If it's defense-in-depth, please use V2ExpressionUtils.resolveRefs for consistency with the planning-time path.

Done. Extracted V2ExpressionUtils.resolveAttributeRefs and use it in both PartitionPruning and PushDownUtils for consistency.

cloud-fan · 2026-04-22T11:42:37Z

+   *
+   * @since 4.2.0
+   */
+  default boolean supportsIterativeFiltering() {


SupportsPushDownV2Filters already has supportsIterativePushdown() for the same concept. Two names for one capability is a consistency hazard for connectors that implement both interfaces. Worth aligning — supportsIterativeFiltering matches the local filter() verb, supportsIterativePushdown matches the sibling interface. Happy either way, just ideally one name.

yea to add to the name context, on runtime filter the API is 'filter', so there's no mention of pushdown.

So supportsIterativePushdown may not make sense. But im open as well cc @aokolnychyi

actually never mind, it does make sense as the new other method is 'pushedPredicate', i changed the name to match now

cloud-fan · 2026-04-22T11:42:37Z

 * and only one of them should be implemented by the data sources.
- *
+ * <p>
+ * <b>Iterative filtering:</b> When {@link #supportsIterativeFiltering()} returns true,


The description says filter() "may be called multiple times" but doesn't state the call order. The implementation pushes translated V2 predicates first and PartitionPredicate instances second, and the in-repo InMemoryEnhancedRuntimePartitionFilterBatchScan already relies on that ordering. Worth documenting explicitly so implementations know what to expect in each call.

Done. Updated Javadoc to document the two-pass call order and that the second pass excludes filters already accepted via pushedPredicates().

cloud-fan · 2026-04-22T11:42:37Z

+      checkAnswer(df, Row(3, 3))
+
+      assertHasRuntimeFilters(df)
+      assertPushedPartitionPredicates(df, 0)


This assertion is effectively a no-op here. getPushedPartitionPredicates pattern-matches specifically on InMemoryEnhancedRuntimePartitionFilterBatchScan; for any other scan class (case 11 uses InMemoryV2FilterBatchScan) it returns Seq.empty. So assertPushedPartitionPredicates(df, 0) is trivially true regardless of what actually happened — it would pass even if a PartitionPredicate did get pushed. More reliable options: track raw filter() argument types on the test scan and assert no PartitionPredicate was seen, or add a variant of the new test table that returns supportsIterativeFiltering=false so the existing helper can still inspect it.

Done. Case 11 now uses the enhanced catalog with supports-iterative-filtering=false table property, so the assertPushedPartitionPredicates helper can properly inspect the scan.

cloud-fan · 2026-04-22T11:42:37Z

+   * Only runtime filters that were not already translated are used to derive PartitionPredicates
+   * in the second pass, avoiding duplicate pushdown.


The code filter is !filtersToTranslated.get(f).exists(pushed.contains) — i.e., exclude filters whose translation was already accepted (present in pushedPredicates()). Translated-but-rejected filters are still candidates. The inline comment on line 161-163 already states this correctly.

Suggested change

* Only runtime filters that were not already translated are used to derive PartitionPredicates

* in the second pass, avoiding duplicate pushdown.

* Only runtime filters whose translated form was not already accepted by the data source in

* the first pass are used to derive PartitionPredicates in the second pass, avoiding duplicate

* pushdown.

Done. Updated scaladoc to clarify "accepted" vs "rejected".

cloud-fan · 2026-04-22T11:42:37Z

+
+  test("case 11: supportsIterativeFiltering is false -> no PartitionPredicate") {
+    val baseCatalog = "testv2filterNoIterative"
+    spark.conf.set(s"spark.sql.catalog.$baseCatalog",


This conf isn't unset; only spark.sessionState.catalogManager.reset() runs in after. Consider wrapping with withSQLConf (or unsetting at the end) so the conf is cleaned up between tests.

Done. Wrapped with withSQLConf.

cloud-fan · 2026-04-22T11:42:37Z

+        if (filterableScan.supportsIterativeFiltering()) {
+          val filterAttrs = AttributeSet(filterableScan.filterAttributes()
+            .flatMap(r => output.find(a => SQLConf.get.resolver(a.name, r.fieldNames.head))))
+          val pushed = filterableScan.pushedPredicates().toSet


Minor — pushedPredicates() is invoked twice (here and again on line 180 for the return signal). For a connector with a non-trivial implementation, one call would do:

val before = filterableScan.pushedPredicates().toSet // ... second-pass logic uses `before` ... // at the end, reuse `filterCalled` (from the F1 suggestion) instead of re-querying pushedPredicates

Done. Refactored to val approach — pushedPredicates() is now called only once, and the return value uses translatedFiltersPushed || partPredicatesPushed.

…test improvements - Return translatedFiltersPushed || partPredicatesPushed instead of pushedPredicates().nonEmpty, so filter() side effects are visible even if the connector does not override pushedPredicates(). - Extract V2ExpressionUtils.resolveAttributeRefs to share resolution logic between PartitionPruning and PushDownUtils. - Clarify SupportsRuntimeV2Filtering javadoc: document two-pass call order and that the second pass excludes already-accepted filters. - Refactor case 11 to use the enhanced catalog with supports-iterative-filtering=false property and withSQLConf. - Add regression test for buggy connector that omits first-pass filters from pushedPredicates(). - Code tidying in InMemoryEnhancedRuntimePartitionFilterTable.

cloud-fan

Re-review of 6b07f10: 7 addressed, 1 remaining, 0 new.

Addressed:

F1 (BLOCKING): pushRuntimeFilters now returns translatedFiltersPushed || partPredicatesPushed — a direct signal of whether filter() was invoked, independent of pushedPredicates(). Added a regression test (pushedPredicates() omits first-pass filters -> second round still prunes) that covers the buggy-connector case.
F2: the runtime filterAttributes re-check now uses the new shared V2ExpressionUtils.resolveAttributeRefs helper, so PushDownUtils and PartitionPruning use the same resolver.
F4: class-level Javadoc on SupportsRuntimeV2Filtering now documents the two-pass call order (translated V2 predicates first, PartitionPredicate second).
F5: case 11 now uses the enhanced catalog with a new supports-iterative-filtering=false table property, so assertPushedPartitionPredicates actually inspects the scan instead of trivially matching Seq.empty.
F6: scaladoc on pushRuntimeFilters now distinguishes "not already accepted" from "not already translated".
F7: case 11 now wraps its scan-local conf in withSQLConf for proper cleanup.
F8: pushedPredicates() is invoked once per pushRuntimeFilters call (via a val that is reused for both the candidate filter and the return signal).

Remaining:

F3: supportsIterativeFiltering vs. the sibling SupportsPushDownV2Filters#supportsIterativePushdown naming is still open pending @aokolnychyi's input (#55382 (comment)). No blocker from my side either way — just one name.

New: none.

…tivePushdown Align with SupportsPushDownV2Filters.supportsIterativePushdown() naming.

cloud-fan

Re-review of f426fbc: 1 addressed, 0 remaining, 0 new — F3 is now resolved.

Addressed this round:

F3: supportsIterativeFiltering renamed to supportsIterativePushdown, aligning with SupportsPushDownV2Filters#supportsIterativePushdown(). Test property key and test case names updated consistently; no stale references remain.

All eight findings from the original review are now resolved. No new concerns on my side.

cloud-fan · 2026-04-23T01:12:17Z

thanks. merging to master!

[SPARK-56521][SQL] Support PartitionPredicate in runtime filters

e085be4

szehon-ho force-pushed the partition-predicate-runtime-filter branch 2 times, most recently from c9c87bd to 939e128 Compare April 17, 2026 01:55

[SPARK-56521][SQL] Refactor BatchScanExec: guard cast with runtimeFil…

0ba0445

…ters.nonEmpty, simplify partPredicates

szehon-ho force-pushed the partition-predicate-runtime-filter branch from 939e128 to 0ba0445 Compare April 17, 2026 01:58

szehon-ho added 4 commits April 17, 2026 10:59

[SPARK-56521][SQL] Refactor pushRuntimeFilters into BatchScanExec com…

2ba33ff

…panion object Extract runtime filter pushing logic from filteredPartitions into a companion object method with a pattern match guard, removing the asInstanceOf cast.

[SPARK-56521][SQL] Move pushRuntimeFilters to PushDownUtils

f5aa92a

Move the runtime filter pushing logic from the BatchScanExec companion object to PushDownUtils, co-locating it with the related partition predicate helpers.

cloud-fan reviewed Apr 22, 2026

View reviewed changes

[SPARK-56521][SQL] Rename supportsIterativeFiltering to supportsItera…

f426fbc

…tivePushdown Align with SupportsPushDownV2Filters.supportsIterativePushdown() naming.

cloud-fan reviewed Apr 22, 2026

View reviewed changes

cloud-fan closed this in 4c56079 Apr 23, 2026

		* Only runtime filters that were not already translated are used to derive PartitionPredicates
		* in the second pass, avoiding duplicate pushdown.

-   * Only runtime filters that were not already translated are used to derive PartitionPredicates
-   * in the second pass, avoiding duplicate pushdown.
+   * Only runtime filters whose translated form was not already accepted by the data source in
+   * the first pass are used to derive PartitionPredicates in the second pass, avoiding duplicate
+   * pushdown.

Conversation

szehon-ho commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

szehon-ho Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

szehon-ho commented Apr 17, 2026 •

edited

Loading

szehon-ho Apr 22, 2026 •

edited

Loading

szehon-ho Apr 22, 2026 •

edited

Loading