feat(rust): Vortex Phase 2 — AExpr-direct filter pushdown (stacked on #2)#1
feat(rust): Vortex Phase 2 — AExpr-direct filter pushdown (stacked on #2)#1lwwmanning wants to merge 62 commits into
Conversation
|
The uncompressed lib size after this PR is 59.4950 MB. |
1 similar comment
|
The uncompressed lib size after this PR is 59.4950 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
4 similar comments
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.5098 MB. |
|
The uncompressed lib size after this PR is 59.3440 MB. |
|
The uncompressed lib size after this PR is 59.5096 MB. |
1 similar comment
|
The uncompressed lib size after this PR is 59.5096 MB. |
|
The uncompressed lib size after this PR is 59.5094 MB. |
|
The uncompressed lib size after this PR is 59.5093 MB. |
1 similar comment
|
The uncompressed lib size after this PR is 59.5093 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
4 similar comments
|
The uncompressed lib size after this PR is 59.5095 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
|
The uncompressed lib size after this PR is 59.5095 MB. |
…O + EqValidity None tests Cycle-1 gauntlet (preset=pr-2, fresh+correctness) accepted PR-2.1 with no must-fix but flagged 5 should-fix items. This commit addresses 2 of them; the others (3 deferred) are tracked in the Deferred work section in the follow-up plan commit. 1. **Bitwise-vs-logical operator caveat** (cycle-1 correctness should-fix #1+#2; cycle-1 fresh should-fix F-003): Polars `Operator::And`/`Or` and `IRBooleanFunction::Not` are bitwise-OR-logical (work on integer columns too — see boolean.rs:49 `// Also bitwise negate`). Vortex's `and`/`or`/`not` are boolean-only. For embedded integer-bitwise sub-trees like `(col_int & 1) > 0`, the convertor would emit semantically wrong Vortex expressions. Polars's own column_expr.rs: 245-247 guards with `dtype.is_bool()`. Added a load-bearing TODO comment at the doc-block of `aexpr_to_vortex_expression` explaining the constraint + naming PR-2.2's wire-up site as the schema-gated mitigation point. Option (a) at the call site (skip pushdown if any And/Or/Not operand is non-bool) is cheaper than threading `&Schema` through the convertor. 2. **Missing None-returning tests for `EqValidity`/`NotEqValidity`** (cycle-1 fresh should-fix F-002): the explicit `return None` arms at the match for these operators had no test coverage. Added two ~10-LoC tests parallel to `unsupported_xor_returns_none`. Test count 21 → 23. Verified: `cargo test -p polars-plan --features vortex,is_between,is_in vortex_convertor` → 23 passed, 0 failed.
3b7260d to
d89b7a5
Compare
Consolidates the amend-completion + Step 2.3-step-2 awaiting-review transitions for PR-2.7 (implementation commits 065df50 + 0e2c26f landed ahead of this plan update — minor ordering drift from the canonical inner-loop sequence; substance unchanged). Current State now reflects: - status: awaiting-review - current_pr: PR-2.7, pr_index: 8 (post-amend, PR-2.7 is the 8th in Phase 2) - planning_sub_flow: null (amend sub-flow closed) - last_phase_end_verdict: null (reset per amend semantics; next phase-end review fires as cycle 2 of Phase 2 after PR-2.7 + PR-2.8 complete) - phase_entry_sha unchanged (93643dd) so cumulative-phase reviews still span original PR-2.0..PR-2.6 + amend PR-2.7 + PR-2.8 (cherry picked from commit 8639dd5)
Both reviewers (fresh + correctness) REJECT with high confidence. Three must-fix correctness gates of the same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType: (1) is_between arm lacks pairwise-PType gate (vortex_convertor.rs:319-340); (2) Ternary arm lacks THEN/ELSE pairwise-dtype gate (vortex_convertor.rs:563-572); (3) StringExpr arm lacks Utf8 input gate (vortex_convertor.rs:471-510). All three violate always-SAFE-fallback contract. Severity dedupe on Ternary disagreement = HIGHEST per spec (fresh: must-fix, correctness: should-fix with reduced-reach rationale). 6 should-fix + 2 nit also flagged. Entering Step 2.4 fix-application loop.
<details><summary>gauntlet Synthesizer Output (schema_version: 1)</summary>
```json
{
"schema_version": 1,
"preset": "pr-2",
"lenses_used": ["fresh", "correctness"],
"review_count": 2,
"unified_findings": [
{"severity": "must-fix", "kind": "bug", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:319-340", "description": "is_between arm lacks pairwise-PType gate between col and bounds. The arm calls Vortex's gt/gt_eq/lt/lt_eq builders DIRECTLY (not via AExpr::BinaryExpr recursion), so the BinaryExpr arm's pairwise-PType gate is NEVER invoked. The comment at L312-318 falsely claims 'schema-gated only transitively' via the BinaryExpr arm. Cross-PType bounds (e.g., Int32 col with Int64 literal bounds when TYPE_COERCION is off) will vortex_bail! at scan-time in Vortex's Binary::return_dtype (vortex-array/src/scalar_fn/fns/binary/mod.rs:130-136) with 'Cannot compare different DTypes', violating the always-SAFE-fallback contract. Same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType must-fixes. Highly reachable: column dtypes are diverse and Rust literal types easily mismatch.", "recommended_fix": "Add pairwise gate BEFORE building comparisons: `let s = schema?; let col_dt = resolve_inner_dtype(input[0].node(), arena, s)?; let lo_dt = resolve_inner_dtype(input[1].node(), arena, s)?; let hi_dt = resolve_inner_dtype(input[2].node(), arena, s)?; if col_dt != lo_dt || col_dt != hi_dt { return None; }`. Mirror the precedent in Plus / comparison gates for the type_coercion-off path. Add a unit test `shape_is_between_cross_ptype_returns_none` (col=Int32, lo/hi=Int64). After the gate lands, rewrite or delete the misleading transitivity comment.", "found_by": ["fresh", "correctness"]},
{"severity": "must-fix", "kind": "bug", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:563-572", "description": "Ternary arm lacks THEN/ELSE pairwise-dtype gate. Comment at L556-562 falsely claims 'Vortex's CaseWhen is type-checked at builder time' — inspection of vortex-array/src/expr/expression.rs:45-62 shows try_new only validates arity; the dtype unification fires in CaseWhen::return_dtype at scan/pruning time (vortex-array/src/scalar_fn/fns/case_when.rs:185-191) with 'CaseWhen THEN and ELSE dtypes must match'. Cross-dtype truthy/falsy (e.g., nested Ternary with col_int32 vs col_int64) recurse independently and bypass the gate. Same bug class as PR-2.2 cycle-1 Plus must-fix; violates always-SAFE-fallback contract. Practical reach reduced for a Ternary at the filter root (Polars validates it must be Boolean-typed), but nested Ternary inside a complex predicate when TYPE_COERCION is off remains reachable.", "recommended_fix": "Mirror Plus arm gate BEFORE recursing: `let s = schema?; let then_dt = resolve_inner_dtype(*truthy, arena, s)?; let else_dt = resolve_inner_dtype(*falsy, arena, s)?; if then_dt != else_dt { return None; }`. resolve_inner_dtype currently has no Ternary arm — recursion through it returns None which is the safe fallback. Fix the false comment to accurately describe the explicit gate. Add unit test `shape_ternary_cross_dtype_returns_none`.", "found_by": ["fresh", "correctness"]},
{"severity": "must-fix", "kind": "bug", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:471-510", "description": "StringExpr arm lacks Utf8 input gate on input[0]. Vortex's Like.return_dtype vortex_bail!s at scan-time if input dtype isn't Utf8 (vortex-array/src/scalar_fn/fns/like/mod.rs:124-132). If input[0] is non-String (e.g., a struct field access producing Int32, or a Cast whose source isn't String), pushdown emits a Vortex `like(<Int32-typed-expr>, <utf8-pattern>)` that fails at scan-time. Violates always-SAFE-fallback contract — same class as PR-2.3 CAST gate.", "recommended_fix": "Add Utf8 gate: `let s = schema?; let col_dt = resolve_inner_dtype(input[0].node(), arena, s)?; if col_dt != DataType::String { return None; }` BEFORE building the `like` call. Add unit test `shape_starts_with_on_int_column_returns_none`.", "found_by": ["fresh"]},
{"severity": "should-fix", "kind": "missing-acceptance", "file_line": "py-polars/tests/unit/io/test_vortex.py:330-489", "description": "Acceptance criterion (c) NOT MET: no test verifies pushdown actually engages. All 7 e2e tests assert only post-decode correctness, which the residual-reapply preserves regardless. A regression where the convertor returns None for all 6 new shapes would silently fall back to residual and still pass every test.", "recommended_fix": "Add at least one engagement assertion per shape — either inspect `pl.scan_vortex(...).filter(...).explain()` for the pushed Vortex expression, use display_tree() to verify the pushdown path, or POLARS_VERBOSE + caplog. Alternative: Rust integration test mirroring shape_plus_arithmetic_structural's structural-assertion discipline.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "coverage", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:1889-1947", "description": "is_in arm has ZERO unit tests. Test-module comment claims 'covered by e2e Python tests; arena-level Series literal construction is awkward' but the e2e tests don't cover: (a) nulls_equal=true + haystack-contains-null (regression of this gate would silently push a narrower filter); (b) per-scalar conversion failure; (c) empty haystack. is_between IS unit-tested, contradicting the 'awkward' rationale.", "recommended_fix": "Add unit tests: shape_is_in_positive (using Series::from), shape_is_in_nulls_equal_with_null_returns_none, shape_is_in_empty_haystack_returns_none, shape_is_in_per_scalar_conversion_failure_returns_none.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "coverage", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:1898-2089", "description": "Coverage gaps that would catch the must-fix bugs: (a) is_between cross-PType bounds; (b) Ternary cross-dtype; (c) StringExpr on non-String column; (d) is_between Right/None ClosedInterval variants (only Both/Left covered both at unit and e2e level); (e) bytes_to_like_literal wildcard test only covers '%', missing '_' and '\\\\'.", "recommended_fix": "Add tests alongside the must-fix gate work: shape_is_between_cross_ptype_returns_none, shape_ternary_cross_dtype_returns_none, shape_starts_with_on_int_column_returns_none, shape_is_between_right_inclusive, shape_is_between_none_inclusive, bytes_to_like_literal arm tests for '_' and '\\\\'.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "doc-quality", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:27-64", "description": "Module-level doc-comment stale post-PR-2.7. L27 still says '(PR-2.1 foundation + PR-2.2 + PR-2.3 + PR-2.4 extensions)'. L29 still says '16 shapes' — should be 22. Shape table (L33-50) missing 6 new shapes. 'What this module does NOT cover yet' paragraph (L52-64) explicitly lists 'Ternary' (L61) — now covered.", "recommended_fix": "Update doc to include PR-2.7 shapes in the table, bump '16 shapes' to '22 shapes', remove 'Ternary' from the not-covered list, remove 'is_between'/'is_in'/'str.starts_with'/'str.ends_with' caveats.", "found_by": ["fresh"]},
{"severity": "should-fix", "kind": "convention", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:311-318", "description": "Factually wrong comment: 'Schema-gated only transitively: the recursive aexpr_to_vortex_expression calls inherit the existing comparison pairwise-PType gate (BinaryExpr arm)'. False — the arm builds gt/gt_eq DIRECTLY via Vortex builders, never re-entering the AExpr::BinaryExpr arm. Misleads readers and is design root cause of must-fix #1.", "recommended_fix": "After adding the explicit pairwise gate, rewrite the comment to describe the explicit gate (not a transitive one). Or delete the misleading claim.", "found_by": ["fresh", "correctness"]},
{"severity": "should-fix", "kind": "convention", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:556-562", "description": "Comment falsely claims 'Vortex's CaseWhen is type-checked at builder time' AND 'recursion's existing CAST/Plus pairwise-PType gates protect against mismatched arms'. Both FALSE: case_when() builder only validates arity; inner Plus/CAST gates protect their own subtrees, not Ternary truthy-vs-falsy mismatch. Root cause of must-fix #2.", "recommended_fix": "Rewrite to describe the added gate. Cite case_when.rs:185-191 as the bail-at-scan-time site.", "found_by": ["fresh", "correctness"]},
{"severity": "nit", "kind": "doc-quality", "file_line": "crates/polars-vortex/README.md:240", "description": "is_in README entry mentions only the nulls_equal=true+nulls refuse path; doesn't mention the per-scalar conversion refuse (e.g., Duration scalars in haystack fail polars_scalar_to_vortex).", "recommended_fix": "Append 'or any haystack scalar fails Polars→Vortex conversion' to the is_in row's path-description cell.", "found_by": ["fresh"]},
{"severity": "nit", "kind": "encapsulation", "file_line": "crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:855-875", "description": "bytes_to_like_literal is private; only '%' branch tested indirectly. Other branches ('_', '\\\\', invalid UTF-8) are dead from a test standpoint.", "recommended_fix": "Either make pub(super) for direct unit testing, or add three more arm-level negative tests using each unsafe byte: '_', '\\\\', and a needle that's not valid UTF-8.", "found_by": ["fresh"]}
],
"disagreements": [
{"topic": "Ternary cross-dtype gate severity (must-fix vs should-fix)", "positions": [{"lens": "fresh", "position": "must-fix — same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.2 cycle-1 Plus must-fixes. Violates always-SAFE-fallback contract. Should be held to the same bar as the other pairwise-dtype gates."}, {"lens": "correctness", "position": "should-fix — same bug class but with reduced practical reach. A Ternary at the filter root must be Boolean-typed (Polars validates) so the cross-dtype hazard is gated to nested Ternary inside a complex predicate when TYPE_COERCION is off."}], "synthesizer_call": "Hold at must-fix per spec rule (severity dedupe = HIGHEST) and per the existing precedent pattern (Plus/comparison gates were must-fix even though TYPE_COERCION often handles cross-PType in practice). Reduced reach is a mitigation argument, not a category change. Correctness's reachability rationale captured for implementer awareness."}
],
"dropped_re_flags": [],
"executive_summary": "Both reviewers independently converged on REJECT with high confidence. Three must-fix correctness gates of the same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType: (1) is_between arm lacks pairwise-PType gate; (2) Ternary arm lacks THEN/ELSE pairwise-dtype gate; (3) StringExpr arm lacks Utf8 input gate. All three violate the always-SAFE-fallback contract that PR-2.6's Option B->A cutover rests on. Fixes are 4-6 lines each, mirroring the Plus arm gate precedent. Two of the three are accompanied by misleading comments that mask the design errors and need rewriting alongside the gate fixes. Severity disagreement on Ternary resolved per spec to HIGHEST = must-fix. Coverage and engagement-assertion gaps reinforce the must-fix work: acceptance criterion (c) is NOT met; is_in arm has ZERO unit tests despite the 'awkward' rationale; is_between Right/None ClosedInterval variants are unexercised; cross-PType refusal tests for the three must-fix gates are missing. Verified design strengths: bytes_to_like_literal escape logic preserves legacy parity; ClosedInterval mapping correct; match arm order correct; cfg-gating consistent; is_in's haystack-null refuse decision sound; multi-byte UTF-8 needles handled cleanly. Total estimated work: 12-25 LoC + ~10 unit tests + doc updates.",
"overall": "reject",
"must_fix_count": 3,
"should_fix_count": 6,
"nit_count": 2,
"review_cycles_this_invocation": 1
}
```
</details>
(cherry picked from commit a094497)
…ype gate (review finding) The is_between arm constructs Vortex gt/gt_eq/lt/lt_eq builders DIRECTLY rather than re-entering AExpr::BinaryExpr, so the BinaryExpr arm's pairwise-PType gate would NEVER fire transitively. Cross-PType bounds (Int32 col with Int64 literal bounds when TYPE_COERCION is off) would vortex_bail! at scan-time in Vortex's Binary::return_dtype with 'Cannot compare different DTypes', violating the always-SAFE-fallback contract. Same bug class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison pairwise-PType must-fixes. Add an explicit pre-recursion pairwise-PType gate (4 lines) mirroring the BinaryExpr arm's pattern. Rewrite the misleading 'Schema-gated only transitively' comment to describe the explicit gate. Add 4 unit tests (shape_is_between_right_inclusive, shape_is_between_none_exclusive, shape_is_between_cross_ptype_returns_none, shape_is_between_without_schema_returns_none) which together close the cycle-1 should-fix #6d ClosedInterval coverage gap AND the cycle-1 should-fix #6a cross-PType test gap. (cherry picked from commit 9098452)
… gate) (cherry picked from commit d15b12e)
…wise-dtype gate (review finding) The Ternary arm fed truthy/falsy into Vortex's case_when builder without verifying they share a dtype. case_when's try_new builder only validates arity (vortex-array/src/expr/expression.rs:45-62); the dtype unification check fires later in CaseWhen::return_dtype (vortex-array/src/scalar_fn/fns/case_when.rs:185-191) which vortex_bail!s with 'CaseWhen THEN and ELSE dtypes must match' at scan/pruning time. Cross-dtype truthy/falsy (nested Ternary with col_int32 vs col_int64 when TYPE_COERCION is off) bypassed the inner arms' CAST/Plus gates, violating the always-SAFE-fallback contract. Same bug class as PR-2.2 cycle-1 Plus must-fix. Reachability: a Ternary at the filter root must be Boolean-typed (Polars validates), so the cross-dtype hazard is gated to nested Ternary inside a complex predicate. Held to must-fix per existing precedent (Plus / comparison gates) — always-SAFE-fallback contract is binary. Add an explicit pre-recursion THEN/ELSE pairwise-dtype gate (4 lines) mirroring the Plus arm pattern. Rewrite the misleading 'CaseWhen is type-checked at builder time' + 'CAST/Plus gates protect mismatched arms' comments to accurately describe the explicit gate. Add 2 unit tests (shape_ternary_cross_dtype_returns_none, shape_ternary_without_schema_returns_none). (cherry picked from commit 85ca567)
(cherry picked from commit 65d1e15)
…gate (review finding)
Vortex's Like kernel (vortex-array/src/scalar_fn/fns/like/mod.rs:124-132
return_dtype) vortex_bail!s with 'Cannot apply like to non-Utf8 input' at
scan-time if input[0] isn't Utf8. The StringExpr arm fed input[0] directly
into the recursive convertor without verifying the column's dtype, so a
non-String input (struct field access producing Int32, Cast whose source
isn't String) would silently push a like(<non-utf8>, <utf8-pattern>) that
fails at scan-time, violating the always-SAFE-fallback contract. Same bug
class as PR-2.3 cycle-1 CAST cross-kind must-fix.
Add an explicit Utf8 input gate (3 lines) BEFORE any needle extraction or
pattern construction work. Update existing 4 StringExpr unit tests
(shape_starts_with_positive / _wildcard_in_needle / shape_ends_with_positive
/ shape_contains_literal_{true,false}) to pass schema (the gate now consults
schema?). Add 4 new unit tests: shape_starts_with_on_int_column_returns_none
(must-fix #3 regression), shape_starts_with_without_schema_returns_none
(no-schema refuse), shape_starts_with_underscore_in_needle_returns_none and
shape_starts_with_backslash_in_needle_returns_none (close cycle-1 should-fix
#6e and nit pola-rs#11 coverage gap).
Also adds schema_s_string() and schema_s_string_i_int32() test helpers.
(cherry picked from commit b2b9659)
… Utf8 gate) — all 3 must-fix done; should-fix sweep next (cherry picked from commit 8397a3a)
- vortex_convertor.rs module-level doc-comment refresh (should-fix pola-rs#7): bumps shape count 16 -> 22, adds 6 PR-2.7 shapes to the table with their Vortex builder + gate notes, removes Ternary from the 'not covered yet' list, updates PR-2.5 reference to reflect the Deferred-work slip. - 5 structural-assertion unit tests for the new shapes (should-fix #4 — acceptance criterion (c)): shape_is_between_structural, shape_starts_with_structural, shape_ends_with_structural, shape_contains_literal_true_structural, shape_ternary_structural. Each asserts the produced Vortex Expression's Display contains the expected builder anchor (and/like/case_when) plus the column/literal substrings. Catches paste-swap bugs the prior .is_some()-only tests would miss (mirrors the shape_plus_arithmetic_structural discipline from PR-2.2 cycle-1). is_in structural omitted — its Series-literal arena construction needs SpecialEq<Series> infrastructure not yet wired into the test helpers; e2e Python test covers correctness. - README.md is_in entry refinement (cycle-1 nit pola-rs#10): document the per-scalar-conversion refuse path (Duration scalars in haystack fail polars_scalar_to_vortex) alongside the existing nulls_equal+nulls refuse-path note. (cherry picked from commit 9473106)
…Ternary arm (review finding) The cycle-1 must-fix #2 fix added a THEN/ELSE pairwise-dtype gate to the Ternary convertor arm, calling resolve_inner_dtype(*truthy)? to verify match. But resolve_inner_dtype had no Ternary arm — fell to the catchall _=>None — so any NESTED Ternary truthy/falsy would unconditionally return None, ?-propagating to refuse pushdown even when inner dtypes match. Fail-closed-safe but a real coverage regression vs the pre-gate baseline. Add an AExpr::Ternary arm to resolve_inner_dtype that recurses on truthy+falsy and returns Some(t) when t==e, mirroring the Plus arm's self-contained verification pattern at L726-730. Walk truthy + falsy, require equality (Vortex's case_when builder requires THEN and ELSE share a return_dtype), unresolvable shapes fall through to None. Add a regression test shape_ternary_nested_matching_dtypes_pushes that constructs a 2-level nested Ternary (outer.truthy = inner Ternary) and asserts pushdown engages. Before this fix the test would have asserted .is_some() but produced None. (cherry picked from commit fa89c9b)
…/is_in imports (review finding) PR-2.7 added 5 new imports (AnyValue, VortexScalar, like, or_collect, and the fn bytes_to_like_literal) that are used ONLY inside arms gated on strings or is_in, but the imports themselves were ungated. Under `cargo build -p polars-plan --features vortex --no-default-features` this produced 5 fresh unused-import warnings; the umbrella CI build uses --all-features so the warnings didn't fail loud. Gate each import + the fn definition matching the existing pattern (e.g., the cycle-1 PR-2.7 already gated IRStringFunction on `#[cfg(feature = "strings")]` — same idiom for the underlying types the StringExpr arm uses). Split AnyValue from DataType in the prelude line (DataType is used by every arm — must stay ungated; AnyValue is used only in the StringExpr needle extraction). Split like/or_collect out of the multi-import on the vortex::expr line and gate each on strings/is_in respectively. Verified clean: cargo check -p polars-plan --features vortex (no warnings on vortex_convertor.rs), with all PR-2.7 features (--features vortex,is_between,is_in,regex,strings,dtype-struct, same), and umbrella cargo check -p polars --features vortex,cloud,parquet,dtype-full. 88 unit tests pass. (cherry picked from commit d457a13)
…hen structural tests (review finding) Cycle-1 added 5 structural-assertion tests (shape_is_between_structural, shape_starts_with_structural, shape_ends_with_structural, shape_contains_literal_true_structural, shape_ternary_structural) to close cycle-1 acceptance criterion (c) but cycle-2 reviewers flagged three with loose assertions that don't catch specific paste-swap classes: #3 shape_is_between_structural: vacuous negative anchor '!s.contains("eq(") && !s.contains("==")'. Vortex's binary Display uses single-char '=' ('(x = 10)'), never 'eq(' or '=='; both substrings are absent in correct AND eq-paste-swap outputs. Replace with 's.contains(">") && s.contains("<")' (an eq-paste-swap '(x = 10) AND (x = 20)' has neither '>' nor '<'). #4 shape_starts_with_structural + shape_ends_with_structural: don't catch starts↔ends paste-swap. Both produce 'like'+needle+'%' Display strings differing only in '%' position. Strengthen with joint substring (s.contains("prefix%")) + negative anchor (!s.contains("%prefix")). Mirror for ends_with. pola-rs#5 shape_ternary_structural: doesn't catch truthy↔falsy paste-swap. case_when(cond, b, a) still satisfies s.contains('a') && s.contains('b') because both column refs remain present. Switch truthy/falsy from cols 'a' / 'b' to distinct literals 777 / 888 + assert positional ordering (t_pos < f_pos in the Display string). All 88 tests pass (5 structural tests strengthened; shape_plus_arithmetic_structural unchanged). (cherry picked from commit a9918a2)
…+ nit pola-rs#7 — README gate notes + is_in empty-haystack comment (review finding) - README pushdown table: add the three schema-gate notes that the cycle-1 fix-commits added to the convertor module-doc table but didn't propagate to README. is_between → '(schema-gated pairwise-PType)'. starts_with / ends_with / contains → '(schema-gated Utf8 input)'. Ternary → '(schema-gated THEN/ELSE pairwise-dtype)'. Closes the H1 cross-reference drift between convertor source and README documented in cycle-2 finding pola-rs#6. - vortex_convertor.rs is_in arm: add a 6-line comment before or_collect(terms) acknowledging that 'or_collect(vec![])' returns None for an empty haystack → arm returns None. Polars's post-decode residual reapply handles the all-false 'is_in([])' semantic correctly, so this is a missed optimization (we could pre-empt with lit(false)) — not a bug. Closes cycle-2 finding nit pola-rs#7. All 88 unit tests still pass. (cherry picked from commit aebb302)
…e Deferred entry Polars's is_in uses TotalOrd (NaN==NaN); Vortex's eq uses IEEE 754 (NaN!=NaN). This divergence affects ALL float-comparison convertor arms (foundation eq/not_eq/lt/lt_eq/gt/gt_eq since PR-2.1 + PR-2.7's new is_in/is_between). The legacy SpecializedColumnPredicate::EqualOneOf path had the same divergence; PR-2.7 just amplifies the surface. Add a Deferred work entry capturing the divergence with two mitigation paths: (a) refuse pushdown when any float literal is NaN (~10 LoC, conservative); (b) accept the divergence and document as known limitation. Not blocking — residual reapply preserves correctness. Tracked so a future read doesn't assume float pushdown is fully isomorphic with Polars semantics. (cherry picked from commit 5e68b4f)
… ready for cycle 3 (cherry picked from commit 71540df)
Three-cycle inner-loop gauntlet history: - Cycle 1 REJECT (3 must-fix + 6 should-fix + 2 nit) — schema-gate gaps in the same bug class as PR-2.3 / PR-2.4 cycle-1 pairwise-PType must-fixes. - Cycle 2 ACCEPT (0 must-fix + 6 should-fix + 2 nit) — gates correctly implemented; user picked maximum-thoroughness apply-all-polish-inline. - Cycle 3 ACCEPT (0 must-fix + 0 should-fix + 1 nit) — polish verified clean; only finding is pre-existing CastOptions unused-import (PR-2.3 era, out of PR-2.7 scope). 17 PR-work commits 065df50..71540df (feat: convertor arms; docs: e2e tests + README; plan: amend complete + awaiting + cycle 1 atomic; fix×3 + decrement×3 must-fix gates; docs: cycle-1 should-fix sweep; fix×3 cycle-2 should-fix; docs + plan: cycle-2 polish + NaN Deferred + cycle-2 polish landed). 25 new convertor unit tests + 7 new e2e Python tests. 88 total convertor tests pass. Current State advances: status: executing, current_pr: PR-2.8, pr_index: 9, per-PR counters reset. Phase 2 phase-end review (cycle 2) will fire after PR-2.8 completes. Append Implementation status entry capturing scope shipped, tests added, 3-cycle review history, confidence, deferred items (1 new: Float NaN semantic divergence), and 3 surprises during implementation (H4 self- reinforcement validated, H2 new-prose internal edge case validated, prior_fix_commit_sha attention block worked as designed). (cherry picked from commit 024341d)
…olumn per-column predicate split
The PR-2.2 cycle-1 must-fix M1 conservative refuse at lower_ir.rs:801-815
gave up on convertor pushdown ENTIRELY when ANY of hive_parts /
unified_scan_args.row_index / unified_scan_args.include_file_paths was set.
Correctness held via the multi-scan PARTIAL_FILTER reapply but the
file-column parts of a mixed predicate missed Vortex zone pruning.
PR-2.8 turns the all-or-nothing guard into a per-column split mirroring
polars-mem-engine/scan_predicate/functions::create_scan_predicate's
hive_predicate extraction:
1. Add helper aexpr_file_minterms_to_vortex_expression in
vortex_convertor.rs (~30 LoC). Walks top-level conjuncts via
MintermIter, filters to file-only minterms (no leaf-column in
virtual_cols), converts each via aexpr_to_vortex_expression, and
AND-collects. Refuses pushdown if no file-only minterms remain.
Strictly-better than aexpr_to_vortex_expression on the whole tree:
partial conversion is now possible (a == 1 AND unsupported_op(b)
pushes a == 1 instead of refusing the whole thing).
2. Refactor lower_ir.rs FileScanIR::Vortex arm to build the
virtual_cols set from hive_parts.schema().iter_names() +
row_index.name + include_file_paths, then call the new helper
unconditionally instead of gating on hive/row_index/file_paths
presence.
3. Add 5 unit tests covering: all file-only AND-collects all;
all virtual returns None; partial pushes file part only;
top-level OR with virtual operand refuses (single minterm
references both); unsupported subtree dropped in partial push.
4. Add e2e Python test test_scan_with_hive_and_file_col_mixed_filter
exercising (pl.col('x') > 5) & (pl.col('year') == 2024) over a
hive-partitioned vortex dataset. Verifies correctness; a regression
mis-classifying a hive col as file would surface as a Vortex
ComputeError 'missing column year'.
Resolves Deferred work entry 'Virtual-column-partitioned Vortex scans
don't benefit from AExpr convertor pushdown' (PR-2.2 cycle-1 must-fix
M1 + cycle-2 C2-001).
Verification:
cargo test -p polars-plan --features vortex,is_between,is_in,regex,strings,dtype-struct vortex_convertor
→ 93 passed (88 prior + 5 new minterms_* tests)
cargo check -p polars-stream --features vortex,cloud → clean
cargo check -p polars --features vortex,cloud,parquet,dtype-full → clean
(cherry picked from commit 9d9469f)
(cherry picked from commit 86c00f8)
PR-2.8 cycle 1 gauntlet (2-vote pr-2 preset: fresh + correctness)
returned accept with 0 must-fix, 3 should-fix on test rigor:
1. Three new minterms_* unit tests (all_file_only_collects_all,
partial_pushes_file_part_only, unsupported_subtree_dropped_in_
partial_push) asserted only .is_some(). Paste-swap test: swapping
the bodies of any two would leave both passing — a regression
ignoring virtual_cols entirely (pushing both file + virtual
conjuncts to Vortex) or inverting polarity (keeping virtual,
dropping file) would not be caught.
2. e2e coverage for partial-push: only hive-col virtual was
exercised; row_index and include_file_paths virtuals went through
the same lower_ir.rs branch but had no e2e test.
Fixes:
1. Tighten the 3 unit tests with the joint-substring + negative-anchor
structural assertion pattern established in PR-2.7 cycle 2
(shape_starts_with_structural, shape_ends_with_structural).
Changed literals to unique discriminators (42 kept-side, 99 for
the second file conjunct, 9999 for dropped virtual, 888 for dropped
unsupported subtree). Each tightened test now formats the produced
Vortex Expression and asserts both (a) the kept-side literal is
present and (b) the dropped-side literal is absent.
2. Add 2 e2e tests:
- test_scan_with_row_index_and_file_col_mixed_filter — row_index
virtual + (x > 5) & (ri > 10) filter
- test_scan_with_include_file_paths_and_file_col_mixed_filter —
include_file_paths virtual + (x > 5) & src.str.contains('a')
filter across two files
Each new e2e test verifies result correctness; a regression
mis-classifying the virtual col as a file column would surface as
a Vortex ComputeError on missing column.
Carry-forward (cycle 1 deferred / nits):
- e2e pushdown-engagement verification (.explain() / metrics
inspection) remains tied to the POLARS_VORTEX_VERIFY_PUSHDOWN
Deferred entry (no change in this commit).
- virtual_cols hoist outside the and_then closure at lower_ir.rs:807
— true nit, deferred.
Verification:
cargo test -p polars-plan --features vortex,is_between,is_in,regex,strings,dtype-struct minterms_
→ 5 passed (all five PR-2.8 minterms_* tests)
cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings → clean
(cherry picked from commit f55e00b)
…criminator
PR-2.8 cycle 2 gauntlet (2-vote pr-2 preset) rejected with 1 must-fix
caught by BOTH reviewers (fresh + correctness):
test_scan_with_include_file_paths_and_file_col_mixed_filter used
pl.col("src").str.contains("a") as the discriminator between
a.vortex and b.vortex. include_file_paths populates `src` with the
FULL path (per ScanSourceRef::to_include_path_name at
crates/polars-plan/src/dsl/scan_sources.rs, no basename
normalization). pytest's tmp_path fixture generates a path like
/var/folders/.../pytest-of-USER/pytest-N/
test_scan_with_include_file_paths_and_file_col_mixed_filter0/
{a,b}.vortex — the test name itself contains multiple 'a' chars,
so str.contains("a") evaluates TRUE for BOTH a.vortex and b.vortex.
The predicate degenerates to `x > 5` alone, returning 5 rows
instead of the expected 2; assertions fail in CI.
Fix: anchor the discriminator on the basename suffix via
str.ends_with("a.vortex"). Cross-platform, robust against any
tmp_path naming, and clearly intentional. Updated docstring to note
that `src` is the FULL path. Updated final assertion to use
.endswith() (matches the new discriminator's mental model).
The remaining structural assertions on the 3 minterms_* unit tests
(literals 42 / 99 / 9999 / 888 as unique discriminators) are validated
against Vortex's display format (cycle 2 correctness reviewer
confirmed: i32 literals render as `42i32`, no operator or connector
contains digits, no false-positive risk).
Verification:
cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings → clean
cargo test -p polars-plan --features vortex,is_between,is_in,regex,strings,dtype-struct minterms_
→ 5 passed
python3 -c "import ast; ast.parse(open('py-polars/tests/unit/io/test_vortex.py').read())" → OK
(cherry picked from commit c9fd818)
…phase-end review PR-2.8 cycle 3 ACCEPTED by 2-vote pr-2 gauntlet (0 must-fix, 0 should-fix, 0 nit; both lenses clean). Plan updates: 1. Current State: status → phase-boundary; current_pr → null; phase_end_cycle → 2; last_commit → c9fd818; subagent_invocations_this_pr → 6; review_cycles_this_pr → 3 2. Implementation status: added PR-2.8 entry with 4 commits, 3-cycle review history, 2 lessons-learned bullets (e2e discriminator anti-pattern + paste-swap-vulnerability spans both unit + e2e layers) 3. Deferred work: 'Virtual-column-partitioned Vortex scans don't benefit from AExpr convertor pushdown' RESOLVED-in-PR-2.8 with strikethrough convention Phase 2 amend (PR-2.7 + PR-2.8) complete. Phase 2 exit criteria status: - (a) 4-vote phase-end review accepts cycle 2 — PENDING (next: this firing) - (b) §5 pushdown coverage rows implemented/deferred — SATISFIED (cycle 1 ACCEPT, plus PR-2.7 ports 6 cutover-lost shapes + PR-2.8 closes virtual-column gap) - (c) Pushdown engagement via display_tree / POLARS_VERBOSE — PARTIAL (structural assertions on unit tests; e2e engagement-verification via POLARS_VORTEX_VERIFY_PUSHDOWN remains deferred) - (d) Build/test suite green — SATISFIED locally; CI verification pending push - (e) 10 cycle-3 should-fix carry-forward items — SATISFIED in PR-2.0 - (f) Cutover-lost shapes push down — SATISFIED in PR-2.7 - (g) Virtual-column scans push down file-col part — SATISFIED in PR-2.8 Ready for cumulative Phase 2 cycle-2 4-vote phase-end gauntlet (phase-4 preset: spec + correctness + maint + arch lenses) on the cumulative Phase 2 diff (vortex-integration-phase-1..HEAD). (cherry picked from commit 78732e5)
Phase 2 cycle 2 4-vote phase-end gauntlet (phase-4 preset) returned reject with 3 must-fix (1 spec + 2 maint) + 4 should-fix (1 each across spec / correctness / maint / arch lenses). Must-fix items addressed: 1. (spec) Two Deferred work entries 'PR-2.6 cutover-lost pushdown shapes' and 'Two additional §5-row shapes not yet in the AExpr-direct convertor' were not strikethrough/RESOLVED per PR-2.7 plan-row acceptance criterion (f). Both entries now marked with strikethrough + 'RESOLVED in PR-2.7 (commits: ...)' annotation, mirroring the PR-2.8 entry's format. 2. (maint) test_scan_with_hive_partitioning_and_filter docstring described the all-or-nothing virtual-column refuse guard at lower_ir.rs:780-791 (deleted in PR-2.8) and the legacy polars_to_vortex_predicate fallback (deleted in PR-2.6). Rewrote docstring to describe the current PR-2.8 per-minterm split mechanism via aexpr_file_minterms_to_vortex_expression. 3. (maint) test_scan_with_row_index_and_filter docstring similarly stale — described the 'cycle-2 hive-only guard extended to row_index' refuse path that no longer exists. Rewrote to describe the per-minterm split. Should-fix items addressed: 4. (maint SF1) polars-vortex/src/read/predicate.rs module header listed 4 shapes as 'Deferred work items the AExpr-direct convertor does NOT yet handle' — but PR-2.7 ported all 4 plus 2 more. Updated the 'Coverage parity' section to enumerate the 22 shapes now supported as of PR-2.7 + PR-2.8, plus the residual list (non-Strict CAST, cross-kind CAST, regex contains, temporal extracts, Float16). 5. (maint SF2) polars-vortex/README.md inner architecture table called Column(name) gate 'virtual-column refuse in lower_ir' (now per-minterm drop in helper), listed Ternary in the refused-shapes prose (contradicted the outer Pushdown coverage table at line 242 which shows Ternary shipped), and described the lower_ir.rs guard as still refusing-whole-predicate. Updated the table to enumerate the 6 PR-2.7 shapes + the PR-2.8 per-minterm-drop convention; rewrote the prose below the table to describe the strictly-better-than-all-or-nothing behavior. 6. (correctness SF) No cross-PR unit test combined a PR-2.7 shape with a PR-2.8 virtual_col minterm. Added minterms_is_between_with_virtual_col_pushes_is_between_only — exercises the leaf-walk-through-Function-args invariant (aexpr_to_leaf_names_iter must descend through Function args via children_rev to find file vs virtual leaves) and asserts the kept-side bounds (10, 20) are present while the dropped-side literal (9999) is absent. Defends against a regression in leaf-walking. 7. (arch SF) Added 8 single-line `// SCHEMA-GATE: <kind>` markers at each pattern-match arm that directly constructs a Vortex builder (BinaryExpr, is_between, is_in, Function::Boolean (unary), StructField, StringExpr, Cast, Ternary). Greppable via `rg '// SCHEMA-GATE:'`. Module-level docstring extended with a 'SCHEMA-GATE convention' section listing all current gates — codifies the recurring discipline so a fresh arm-adder either replicates the gate or explicitly justifies why no gate is needed. Should-fix items deferred (per spec reviewer's Option (b) recommendation): 8. (spec SF) is_in arena-level structural unit test deferred — adding the test requires constructing a LiteralValue::Scalar(Scalar::new( DataType::List(Box::new(Int32)), AnyValue::List(series))) haystack node, which needs spinning up polars-core Series + AnyValue::List shell (~20-30 LoC of test setup vs ~10 LoC for other shapes). e2e coverage at test_scan_with_is_in_filter protects user-visible correctness. Documented as a Deferred work entry tracking this defense-in-depth gap. Also fixed the misleading comment at vortex_convertor.rs:2046-2058 that claimed is_between also lacked unit coverage (it does have shape_is_between_structural). Plan updates: - 2 Deferred entries strikethrough/RESOLVED in PR-2.7 - 1 new Deferred entry: 'is_in arena-level structural unit test' Verification: cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings → clean cargo test -p polars-plan --lib --features vortex,is_between,is_in,regex,strings,dtype-struct vortex_convertor → 94 passed (was 93; +1 cross-PR test) python3 -c "import ast; ast.parse(open(...test_vortex.py))" → OK (cherry picked from commit 0a1475b)
Phase 2 cycle 3 4-vote phase-end gauntlet (phase-4 preset: spec + correctness + maint + arch lenses) ACCEPTED with 0 must-fix, 0 should-fix, 0 nit across all 4 lenses. Cycle 2 reject had 3 must-fix (1 spec Deferred-entry markup + 2 maint stale e2e docstrings) + 4 should-fix; the sweep commit 0a1475b applied all 7 (3 MF + 4 SF) in a single 5-file diff: - 2 Deferred entries strikethrough/RESOLVED in PR-2.7 with commit refs - 2 stale e2e test docstrings rewritten to describe PR-2.8 mechanism - polars-vortex/src/read/predicate.rs module header refreshed - polars-vortex/README.md inner architecture table refreshed - New cross-PR unit test (PR-2.7 shape × PR-2.8 minterm helper) - 8 SCHEMA-GATE markers + module-doc convention section codifying the recurring schema-gate discipline across PR-2.2/.3/.4/.7 - 1 new Deferred entry: is_in arena-level structural test (per spec reviewer Option (b) — e2e coverage protects user-visible correctness; defense-in-depth gap tracked for Phase 3) Current State updates: - status: phase-boundary (was: same, but now ready for Step 3.4 gate) - current_phase: Phase 2 COMPLETE — awaiting Step 3.4 user gate - phase_end_cycle: 3, phase_end_reject_cycles: 1 - last_phase_end_verdict: accept - last_commit: 0a1475b - subagent_invocations_total: 60 (+9 since cycle 2: 4 cycle 2 reviewers + 4 cycle 3 reviewers + 1 cycle 2 implicit synthesis-skip) Phase 2 exit criteria status (all SATISFIED): - (a) 4-vote phase-end accepts cycle 3 (cycle 1 + 3 both accepted; cycle 2 rejected and was resolved in 0a1475b) - (b) §5 pushdown coverage rows implemented or documented as deferred - (c) Pushdown engagement via structural assertion at unit-test layer (e2e engagement-verification via POLARS_VORTEX_VERIFY_PUSHDOWN remains in Deferred work) - (d) Build/test suite still green (94 unit + 13 e2e tests) - (e) PR-2.0 cycle-3 should-fix carry-forward items resolved - (f) Cutover-lost shapes (PR-2.7) push down (6 shapes shipped) - (g) Virtual-column-partitioned scans (PR-2.8) push down file-col part Phase 2 cumulative: 78 commits on vortex-integration branch since the vortex-integration-phase-1 base (Phase 1 stack tip). Phase 2 amend (PR-2.7 + PR-2.8 + cycle 2 sweep) totals 30 commits. Awaiting Step 3.4 user gate before proceeding to Phase 3 (PR-3.1 file-stats + PR-3.2 multi-file + PR-3.3 nested-type coverage). (cherry picked from commit aa701b9)
…ase 2 cycle 3 The cycle 3 sweep commit 0a1475b introduced new docstrings (3 e2e test rewrites, 2 PR-2.8 e2e tests) + module-level convention doc + 8 SCHEMA-GATE markers + 1 cross-PR test that triggered: 1. ruff: 17 errors (14 D205 'blank line required between summary and description' + 2 W505 'doc line too long' + 1 D301 'use r""" for backslashes' + 1 D209 + 1 RUF100). Of these, 3 were auto-fixable via ruff --fix; 11 required manual docstring restructuring (collapse multi-line summary to single line + blank-line separator + add r""" prefix for `\` in docstring at the wildcard-negative-path test). Many were pre-existing in test_vortex.py from earlier PRs; accumulated unfixed across prior cycles. 2. rustfmt: import-order normalization in vortex_convertor.rs after the new use crate::plans::aexpr::MintermIter line (cycle 1 PR-2.8) shifted neighboring import ordering. cargo fmt applied cleanly. Verification: uv tool run ruff check py-polars/ → All checks passed cargo fmt --check → clean cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings → clean cargo test -p polars-plan --lib --features ... vortex_convertor → 94 passed (cherry picked from commit 6a8dc83)
…r-table refresh (cherry picked from commit 7b0f670)
…e 1++ 2026-05-19 PR-stack restructure: this branch (vortex-integration) is now **Phase 2 only** — AExpr-direct convertor + Option B→A cutover (PR-2.1 through PR-2.8 amend, 60 commits). PR-2.0 cleanups and Phase 3 work (file-stats, multi-file tests) moved to the Phase 1++ branch (vortex-integration-phase-1). User-facing PR split: - PR #2 (Phase 1++): complete robust Vortex integration foundation (read/write + legacy pushdown + file-stats + multi-file tests + PR-2.0 cleanups + bench harness + remaining Phase 1++ work yet to ship: nested-type + small-int + engagement infra + extended benches) - PR #1 (Phase 2): pure perf follow-on — replaces legacy SpecializedColumnPredicate pushdown with AExpr-direct convertor; measured speedup visible against the Phase 1++ baseline benches Backup at backup-vortex-integration-pre-restack preserves the pre-restack tip (7b0f670) for audit. Phase 2 cycle-3 4-vote phase-end ACCEPTED verdict is preserved through the rebase. PR #1 base needs to be retargeted to vortex-integration-phase-1 (currently points at the old tip). Plan rewrite + full Implementation status update deferred to next session.
4c0f5b1 to
1399146
Compare
|
Closed and reopened as a fresh PR to match phase ordering (foundation = lower-numbered PR). Same branch (vortex-integration), same content, no other changes — see the new PR for the active Phase 2 review. |
Replaces the
SpecializedColumnPredicate-based filter pushdown shipped in #2 with an AExpr-direct convertor that walks Polars'Arena<AExpr>at IR-build time inlower_ir.rs::FileScanIR::Vortex, where the arena is still in scope. The convertor handles a strictly broader shape set: scalar comparisons, boolean combinators, null checks, struct field access, numericcol + 1, same-kindCASTunderStrictoptions,is_between/is_in/str.starts_with/str.ends_with/str.contains(literal=True)/Ternary, and arbitrary multi-column predicates composed from the above. Schema gates on every direct-builder arm protect the always-SAFE-fallback contract — Vortex'schecked_add,cast, comparison, and struct-field builders each error at scan-time on certain inputs (cross-kind CAST, cross-PType arithmetic, missing struct fields), and the gates refuse pushdown so the residual filter handles those cases. When pushdown is refused for any shape, the multi-scan layer reapplies the original predicate post-decode viaPARTIAL_FILTER— correctness is invariant, the visible effect is wall-clock.Stacked on #2; review that PR first. The
vortex_scan/filter_arithmeticCriterion bench in #2 is the load-bearing measurement:col + 1 == Nfalls back to "decode everything then filter" on #2 alone and pushes down to Vortex's zone pruner here. Capture aphase-1baseline on #2's tip withcargo bench -p polars --features vortex,cloud,parquet,dtype-full,strings --bench io_vortex -- --save-baseline phase-1, then re-run with--baseline phase-1on this branch to see the delta.🤖 Generated with Claude Code