diff --git a/.big-plans/vortex-integration.md b/.big-plans/vortex-integration.md index 4628ed13e543..40a2fdee5b61 100644 --- a/.big-plans/vortex-integration.md +++ b/.big-plans/vortex-integration.md @@ -1,30 +1,45 @@ -# Vortex Integration into Polars — big-plans plan - -> Continuation of [spiraldb/polars#1](https://github.com/spiraldb/polars/pull/1) (`vortex-integration`, 31 existing commits, +6,497/-80, 73 tests). big-plans takes over the remaining work — retroactive ratification + CI green-up + PR-13 aggressive AExpr pushdown + PR-8 file-stats + PR-6 multi-file/nested coverage + PR-14 benches — and lands as one squash-merged PR onto `spiraldb:main`. +# Vortex Integration into Polars — big-plans plan (Phase 2 branch) + +> **2026-05-19 re-stack**: This branch (`vortex-integration`) is now **Phase 2 only** — AExpr-direct convertor + Option B→A cutover (PR-2.1 through PR-2.8 amend). PR-2.0 cleanups (segment_cache thread-through, etc.) and Phase 3 work (file-stats, multi-file tests) MOVED to the Phase 1++ branch (`vortex-integration-phase-1`) so the user-facing PR split is "Phase 1++ = complete robust Vortex integration" + "Phase 2 = pure perf follow-on (AExpr-direct pushdown vs legacy SpecializedColumnPredicate)". Backup at `backup-vortex-integration-pre-restack`. +> +> **Phase 2 contents on this branch** (60 commits, rebased onto Phase 1++ tip): +> - PR-2.1 — AExpr-direct convertor module foundation (Column/Literal/comparisons/booleans/null-checks) +> - PR-2.2 — wire convertor at lower_ir.rs + Plus arithmetic +> - PR-2.3 — CAST in predicates (same-kind + Strict only) +> - PR-2.4 — Struct field access + proactive Plus cross-PType gate +> - PR-2.5 — Temporal extracts (SLIPPED; Vortex 0.70.0 lacks public datetime_parts) +> - PR-2.6 — Option B → A cutover (delete legacy SpecializedColumnPredicate fast path) +> - Phase 2 cycle-1 phase-end should-fix sweep (CI green-up, plan/doc cleanup, orphan stub removal) +> - PR-2.7 — Cutover-lost shapes (is_between, is_in, starts_with, ends_with, str.contains{literal:true}, Ternary; 3 must-fix gates from cycle-1; cycle-2 polish; cycle 3 accept) +> - PR-2.8 — Virtual-column per-column predicate split (MintermIter-based) +> - Phase 2 cycle-2 phase-end sweep (3 must-fix + 4 should-fix including SCHEMA-GATE markers) +> - Phase 2 cycle-3 phase-end accept + lint fixes +> +> Phase 2's last commit was `7b0f6709f7` pre-restack; the rebased equivalent is the current tip. The Phase 2 cycle-3 4-vote phase-end ACCEPTED verdict is preserved through the rebase. ## Current State ```yaml -status: executing -branch: vortex-integration +status: phase-boundary +branch: vortex-integration (Phase 2 stack tip; REBASED 2026-05-19 onto extended vortex-integration-phase-1) planning_sub_flow: null -current_phase: "PR-2.0 housekeeping + PR-13 aggressive AExpr pushdown" +current_phase: "Phase 2 COMPLETE — re-stacked 2026-05-19 onto Phase 1++ tip" phase_index: 2 -current_pr: PR-2.0 -pr_index: 1 +current_pr: null +pr_index: 9 outstanding_must_fix: 0 -deferred_items_total: 6 -last_user_touchpoint: 2026-05-16T17:02:00Z -last_user_touchpoint_what: "started PR-2.0 (Phase 2.0 housekeeping: 10 cycle-3 should-fix carry-forward + Dedicated single-cache refactor)" +deferred_items_total: 16 +last_user_touchpoint: 2026-05-19T05:00:00Z +last_user_touchpoint_what: "Re-stack complete: 60 Phase 2 commits rebased onto extended vortex-integration-phase-1 tip (which now includes Phase 3 file-stats / multi-file work + PR-2.0 cleanups). Branch ready for cycle-3 phase-end accept verdict to carry forward; PR #1 base needs retargeting to vortex-integration-phase-1 (Phase 1++) tip. Plan rewrite + full Implementation status update deferred to next session." subagent_invocations_this_pr: 0 -subagent_invocations_total: 18 +subagent_invocations_total: 60 review_cycles_this_pr: 0 -phase_entry_sha: fc43d1b8d -phase_end_cycle: 0 -phase_end_reject_cycles: 0 -last_phase_end_verdict: null +phase_entry_sha: 3aabb7693d +phase_end_cycle: 3 +phase_end_reject_cycles: 1 +last_phase_end_verdict: accept current_pr_is_ci_reopen: null -last_commit: fc43d1b8d +last_commit: 01e89522d2 ``` ## Context @@ -93,7 +108,7 @@ Streaming ├─ crates/polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs (153 LoC) └─ crates/polars-stream/src/physical_plan/to_graph.rs:843 (PR-13 integration point — AExpr arena live here) -polars-vortex crate (2,697 LoC, 43 unit + 23 integration = 66 Rust tests) +polars-vortex crate (2,697 LoC, 46 unit + 23 integration = 69 Rust tests) ├─ src/lib.rs (re-exports), src/session.rs (global VortexSession + cache) ├─ src/read/{schema,read_at,predicate,array_bridge,options}.rs + mod.rs └─ src/write/{strategy,writer,sink_writer,array_bridge,df_to_stream,options}.rs + mod.rs @@ -107,7 +122,7 @@ Vortex (upstream, path-dep at ../../../../vortex/vortex) 1. **Mem-engine delegates to streaming.** All file-format scans route through `polars-stream`. The mem-engine planner sets `create_skip_batch_predicate = false` for Vortex; Vortex's `LayoutReader::pruning_evaluation` already does zone-level pruning. 2. **Polars' global `ASYNC` Tokio runtime as the single executor.** `session.rs:40-48` holds a static `LazyLock>` (`EXECUTOR`) wrapping `ASYNC.handle()`; `VortexSession::default().with_handle(Handle::new(Arc::downgrade(&EXECUTOR)))` plumbs Vortex async work onto Polars' threads. The static-LazyLock pattern keeps the executor strongly held for the program lifetime so the `Weak` always resolves — a naive `Arc::downgrade(Arc::new(ASYNC.handle()) as Arc)` would dangle immediately because the outer Arc has no other strong ref. No second runtime. 3. **Arrow C Data Interface as a zero-copy bridge.** `polars-arrow::ffi::ArrowArray` and `arrow_array::ffi::FFI_ArrowArray` are `#[repr(C)]` with identical 9-field layout. `mem::transmute` between them moves ~80 bytes. Compile-time `size_of`+`align_of` asserts + runtime length check provide safety. Works both directions (read + write). -4. **`SpecializedColumnPredicate` already extracted by the optimizer.** Polars' `ColumnPredicates::predicates` map exposes per-column predicates in structured form. The current convertor pattern-matches on the specialized variants and emits Vortex `Expression`s. PR-13 extends with an AExpr-direct path (architectural fork resolved in Step 1.4). +4. **AExpr-direct convertor at IR-build time** (Phase 2 outcome — Option B → A cutover complete). Polars' `ColumnPredicates::predicates` map exposes per-column predicates in structured form, but PR-2.6 deleted the `SpecializedColumnPredicate`-derived `polars_to_vortex_predicate` path entirely. Filter pushdown for Vortex scans now runs ONLY through the AExpr-direct convertor at `polars_plan::plans::predicates::vortex_convertor::aexpr_to_vortex_expression` (~1100 LoC + 64 unit tests in `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs`), wired at `polars-stream/src/physical_plan/lower_ir.rs::FileScanIR::Vortex` arm where the `AExpr` arena is still in scope. The convertor handles 16 shapes (column / literal / 6 comparisons / and/or/not / IsNull/IsNotNull / Plus arithmetic / same-kind Strict CAST / struct field access) with 4 schema gates (And/Or bitwise-vs-logical, Plus numeric+pairwise-PType, comparison pairwise-PType, CAST source-kind+Strict-only); unhandled AExpr shapes fall through to residual via `_ => None` and the multi-scan layer reapplies the predicate post-decode through the `PARTIAL_FILTER` capability. The `SpecializedColumnPredicate` type itself remains in `polars-io::predicates` — Parquet and other formats still consume it; only the Vortex-specific consumer was deleted. ### Architectural discovery from Phase 1.2 (informs PR-13 design) @@ -176,16 +191,51 @@ Reviewers must flag these as immediate **must-fix** if found in the diff. Seeded ## Phases and PRs -### Phase summary +### Post-restack structure (2026-05-19) + +After the 2026-05-19 PR re-stack, the work ships as a 2-PR stack with redrawn responsibility lines: + +- **PR #2 (Phase 1++)** on branch `vortex-integration-phase-1` — complete robust Vortex foundation: read/write, multi-file, hive, cloud, segment cache + thread-through, filter pushdown via the legacy `SpecializedColumnPredicate` machinery, file-level stats via `UnifiedScanArgs::table_statistics`, Criterion bench baseline, multi-file schema-evolution tests. Still pending in this PR: PR-3.3 (nested-type + small-int dtypes + POLARS_VERBOSE engagement infra) + extended benches (originally Phase 4). +- **PR #1 (Phase 2 — THIS BRANCH)** on branch `vortex-integration` — pure perf follow-on: AExpr-direct convertor walking `Arena` at IR-build time, replacing the legacy pushdown path. Originally "Phase 2" in the pre-restack plan; the Phase 1++ work was split out. -Initial draft from handoff-plan skeleton, refined by Phase 1.2 subagent findings. Per-phase scope finalized by Step 1.4 design-tree interview. Review-counts confirmed by user in Step 1.6. +This branch's plan describes Phase 2 (the convertor work) in detail. Phase 1++'s plan (in branch `vortex-integration-phase-1`) describes the foundation + Phase 3/4 absorption work. -| Phase | Name | Scope (one line) | Exit criteria (machine-checkable) | PR count | Review-count | +### Phase 2 scope (this branch) + +| Phase | Name | Scope (one line) | Exit criteria | PR count | Review-count | |---|---|---|---|---|---| -| 1 | Ratify + crates.io transition | Retroactive 4-vote gauntlet of cumulative diff vs `main` + path-dep → crates.io migration + cheap polish | (a) 4-vote phase-end review accepts. (b) `cargo check -p polars --features vortex,cloud,parquet,dtype-full` clean. (c) `cargo test -p polars-vortex --features dtype-date,dtype-datetime,dtype-time,dtype-decimal` → at least 66 Rust tests pass. (d) `pytest py-polars/tests/unit/io/test_vortex.py` → at least 10 tests pass. (e) `gh pr checks 1 --repo spiraldb/polars` shows green for Rust + Python core checks. | 4 | **4-vote** | -| 2 | PR-2.0 housekeeping + PR-13 aggressive AExpr pushdown | PR-2.0 (NEW, head of phase): sweep 10 cycle-3 should-fix carry-forward items + refactor `VortexCacheMode::Dedicated(N)` to thread one resolved cache through both IR + streaming reads. Then PR-13: new convertor module (Option B trajectory per user Step 1.4 decision); arithmetic / CAST / struct field access / temporal extracts shipped incrementally; PR-2.6 (final sub-PR) deletes the `SpecializedColumnPredicate` fast path. | (a) 4-vote review accepts. (b) Every row in existing plan's §5 pushdown coverage table implemented + tested OR documented as deliberately deferred. (c) New e2e tests verify pushdown engagement via Vortex `Expression::display_tree()` or `POLARS_VERBOSE` log assertion. (d) Build/test suite still green. (e) All 10 cycle-3 should-fix carry-forward items resolved or explicitly re-deferred with rationale in PR-2.0. | 7 | **4-vote** | -| 3 | PR-8 file-stats + PR-6 multi-file / nested coverage | Populate `UnifiedScanArgs::table_statistics` from Vortex footer (lead the pattern, no API change); add multi-file scan tests + schema-evolution policy round-trips (`missing_columns`/`extra_columns`/`cast_options`) + nested-type (List/Struct) end-to-end + small-int dtypes (i8/i16/u8/u16) | (a) 4-vote review accepts. (b) Multi-file scan with file-level stats shows whole-file skips (via `EXPLAIN` or Vortex pruning counters). (c) Schema-evolution policy tests pass across the missing/extra/cast matrix. (d) Nested-type roundtrip tests pass. (e) Small-int dtype roundtrip tests pass. (f) Build/test suite green. | 3-4 | **4-vote** | -| 4 | PR-14 benches + final polish + merge prep | Criterion benches (`crates/polars/benches/io_vortex.rs`): cold-cache full scan, filtered scan with column predicates (TPC-H Q6/Q14 style), cloud-read latency, second-run cache-hit ratio, write throughput. Top-level README mention; remaining unblocked deferred items resolved; final 4-vote review on the FULL cumulative diff vs main | (a) 4-vote review accepts. (b) `cargo bench --features vortex --bench io_vortex` compiles + runs. (c) TPC-H Q6/Q14 comparison documented in `crates/polars-vortex/README.md`. (d) `gh pr checks 1 --repo spiraldb/polars` still green. (e) Ready for squash-merge. | 2-3 | **4-vote** | +| 2 | AExpr-direct convertor + Option B→A cutover + amend (cutover-lost shapes + virtual-column per-column split) | PR-2.1 ships convertor module foundation; PR-2.2 wires it at `lower_ir.rs` + Plus arithmetic; PR-2.3 adds CAST; PR-2.4 adds struct field access + proactive Plus cross-PType gate; PR-2.5 slipped (Vortex 0.70.0 lacks public `datetime_parts`); PR-2.6 deletes the legacy `SpecializedColumnPredicate` fast path (Option B → A cutover); PR-2.7 ports the 6 cutover-lost shapes back into the AExpr-direct convertor; PR-2.8 refactors the virtual-column guard from full-refusal to per-column file-vs-virtual split via MintermIter. | (a) 4-vote phase-end review accepts. (b) §5 pushdown coverage rows implemented or formally deferred. (c) Pushdown engagement verified via structural-assertion unit tests + Phase 1++'s `vortex_scan/filter_arithmetic` Criterion bench showing the expected speedup vs the `phase-1` baseline. (d) Build/test suite green. ALL EXIT CRITERIA SATISFIED at 2026-05-19 (cycle-3 4-vote ACCEPT preserved through the 2026-05-19 rebase). | 8 (PR-2.1–.8) | **4-vote** | + +### Phase 2 sub-PR enumeration (this branch) + +PR-2.0 (housekeeping + segment_cache thread-through) was originally a Phase 2 sub-PR but MOVED to Phase 1++ in the 2026-05-19 re-stack — it was conceptually Phase 1 cleanup work that had been deferred. See Phase 1++ branch's plan for its details. + +| PR | Scope | Acceptance | +|---|---|---| +| PR-2.1 | AExpr-direct convertor module foundation (14 shapes: Column, Literal, 6 comparisons, And/Or with logical variants, IsNull/IsNotNull, Not). Convertor lives in `polars-plan` (not `polars-vortex` — dep arrow reverses). | Module compiles; per-shape unit tests with arena-constructed inputs; no wire-up yet. | +| PR-2.2 | Wire convertor at `lower_ir.rs::FileScanIR::Vortex` + Plus arithmetic + And/Or bitwise-vs-logical schema gate. Cycle-1 surfaced hive-column reachability + Plus dtype gate must-fix items. Cycle-2 extended virtual-column guard to `row_index` + `include_file_paths`. | e2e `col + 1 == 5` scans + structural unit test asserts `checked_add` shape. | +| PR-2.3 | CAST in predicates (same-kind + Strict only). Source-kind gate + Strict-options-only gate. | e2e `col.cast(Int64) > 100` over Int32 column pushes down. | +| PR-2.4 | Struct field access. Schema-membership gate. Proactive comparison pairwise-PType gate (H4 sibling to PR-2.3 cycle-2 review). | e2e `col.struct.field("inner") == "x"` pushes down. | +| PR-2.5 | SLIPPED to Deferred work — Vortex 0.70.0 doesn't publicly expose `datetime_parts` / `year` / etc. in `vortex::expr::*`. Residual fallback handles correctly via `_ => None`. | Slipped; multi-scan reapply preserves correctness. | +| PR-2.6 | Delete legacy `SpecializedColumnPredicate` fast path (Option B → A cutover). `polars-vortex/src/read/predicate.rs` reduced to scalar helpers. Cycle-1 surfaced 6 cutover-lost shapes (Deferred entry; resolved in PR-2.7). | Old path gone; tests still pass via the convertor. | +| PR-2.7 | Amend cycle 2 — port 6 cutover-lost shapes back: `is_between`, `is_in`, `str.starts_with` / `_ends_with` / `_contains(literal=True)`, `Ternary`. Each shape has structural-assertion + e2e tests. 3 must-fix gates added (is_between pairwise-PType, Ternary THEN/ELSE pairwise-dtype, StringExpr Utf8 input). | Cutover-lost Deferred entry resolved. | +| PR-2.8 | Amend cycle 2 — refactor `lower_ir.rs` virtual-column guard from full-refusal to per-column file-vs-virtual split. New helper `aexpr_file_minterms_to_vortex_expression` walks top-level conjuncts via MintermIter and drops minterms whose leaves touch virtual cols. | `(file_col > 5) & (hive_col == 2024)` pushes the file-col part to Vortex; virtual-col-only minterms stay residual. | + +### GitHub PR stacking strategy (updated 2026-05-19) + +The work ships as a 2-branch stack: + +``` +spiraldb:main + ↑ +vortex-integration-phase-1 ← PR #2 (Phase 1++): foundation + Phase 3 absorption + extended benches + ↑ +vortex-integration ← PR #1 (Phase 2 — THIS BRANCH): AExpr-direct convertor + cutover + amend +``` + +Sequencing: spiraldb merges PR #2 first (after Phase 1++ phase-end accepts and Phase 1++'s remaining work lands) → rebase PR #1 onto `main` → spiraldb merges PR #1. + +Backup branches preserve pre-restack state: `backup-vortex-integration-pre-restack` (this branch's pre-restack tip `7b0f6709f7`) and `backup-vortex-integration-phase-1-pre-restack` (Phase 1++'s pre-restack tip `e730c6e1d2`). ### PR enumeration @@ -197,21 +247,49 @@ Initial draft. Refined during Step 1.4 + per-PR scope-checks at the start of eac | PR-1.2 | 1 | Phase 1 polish: `VortexCacheMode` Python surface (`cache_mode=` param), visitor feature-gating fix (actual shape: `serde_json` non-optional in `polars-python/Cargo.toml`, since the `scan_type_to_pyobject` Vortex arm uses `serde_json::to_string` while the dep was `optional`-gated on the `json` feature — see commit `6f0fe06a9`), in-memory `ScanSourceRef::Buffer` zero-copy if low-effort | `polars-python/src/lazyframe/general.rs`, `py-polars/src/polars/io/vortex/functions.py`, `polars-python/Cargo.toml` (serde_json non-optional), `polars-vortex/src/read/read_at.rs` | `cache_mode` Python parameter exposed + tested (including bool/float/u64-overflow rejection paths); visitor cfg-gating fix at the Cargo.toml level; in-memory buffer zero-copy deferred (assessed: not low-effort); CI-greenup absorbed (cargo fmt sweep, ruff, dprint, mypy stubs, clippy approx_constant, deny 0BSD, dsl-schema wiring + hashes regen) | | PR-1.3 | 1 | Address any must-fix items from Phase 1 retroactive 4-vote gauntlet | varies | Phase 1 end-of-phase review accepts | | PR-1.4 | 1 | Phase 1 cycle-2 cleanup: thread `VortexScanOptions::segment_cache` through `vortex_file_info`; narrow `pub use ::vortex;` to 5 sub-modules; inline `read::metadata` back-compat shim (one caller update); plan-doc fixes (row-184 stale test counts; producer/writer determinism Deferred entry update) | `crates/polars-plan/src/plans/conversion/dsl_to_ir/scans.rs:300,336,344-345`, `crates/polars-plan/src/dsl/file_scan/mod.rs:21`, `crates/polars-vortex/src/lib.rs:18`, `crates/polars-vortex/src/read/mod.rs:18-22`, `.big-plans/vortex-integration.md` | (a) `cargo check -p polars-stream --features vortex,cloud` clean. (b) 66 Rust + 10 Python tests still pass. (c) Phase 1 end-of-phase review cycle 3 accepts. | +| PR-1.5 | 1 | Vortex Criterion bench harness (Phase 1 ↔ Phase 2 wall-clock anchor) — 3 benches (`no_filter`, `filter_lt`, `filter_arithmetic`) measuring full-scan baseline, comparable-pushdown shape, and the load-bearing Phase-1-residual-vs-Phase-2-pushdown shape. Re-exports `ScanArgsVortex` from polars-lazy prelude under `#[cfg(feature = "vortex")]` (closes a public-API gap — `scan` module was `pub(crate)`, so external Rust callers couldn't construct `ScanArgsVortex` despite the `LazyFrame::scan_vortex` method being public). | `crates/polars/benches/io_vortex.rs` (new), `crates/polars/Cargo.toml` (criterion + polars-vortex + tempfile dev-deps + `[[bench]]` entry), `crates/polars-lazy/src/prelude.rs` (re-export ScanArgsVortex), `crates/polars-vortex/README.md` (bench invocation recipe) | (a) `cargo check -p polars --benches --features vortex,lazy,cloud,parquet,dtype-full,strings` clean. (b) `cargo bench -p polars --features vortex,cloud,parquet,dtype-full,strings --bench io_vortex -- --save-baseline phase-1` runs and produces a captured baseline. (c) Phase 1 baseline numbers documented (paste into `crates/polars-vortex/README.md` or follow-up). | | PR-2.0 | 2 | Phase 2.0 housekeeping: address all 10 cycle-3 should-fix carry-forward items (4 plan-doc + 4 code-doc + 2 mixed) in a focused 2-3 commit sub-PR. Includes a substantive refactor: thread one resolved `Arc` through `FileScanIR::Vortex` (new field) so `Dedicated(N)` allocates ONE cache per logical scan (currently two — IR-time discovery + streaming-time data read). | `.big-plans/vortex-integration.md` (4 plan-doc fixes: stale test counts at :77/:96/:196/:205 sweep; EXECUTOR Arc pattern at :108 → match session.rs:40-48; RUSTSEC-2024-0436 at :339; migrate cycle-1 should-fix items from plan-commit JSON bodies into canonical Deferred work bullets), `crates/polars-plan/src/dsl/file_scan/mod.rs` (FileScanIR::Vortex new `segment_cache` field), `crates/polars-plan/src/plans/conversion/dsl_to_ir/scans.rs` (vortex_file_info function-level doc + thread resolved cache + VortexSegmentCacheRef type alias), `crates/polars-vortex/src/lib.rs` (narrowing comment paths-count drift), `crates/polars-vortex/src/read/predicate.rs` (PR-2.6 scaffolding marker), `crates/polars-vortex/src/read/options.rs` (VortexSegmentCacheRef type alias export; doc on resolve()), `crates/polars-stream/src/nodes/io_sources/vortex/mod.rs` (consume threaded cache rather than re-resolve; document morsel_rx.recv() Err-as-EOS contract), `crates/polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs` (producer-error inline comment trim to 3-4 lines, remove "whichever fires first") | (a) `cargo check -p polars --features vortex,cloud,parquet,dtype-full` clean. (b) 66 Rust + 10 Python tests still pass. (c) Verify Dedicated(N) regression test (new in PR-2.0 OR explicitly added at PR-3.1 entry): single scan-with-discovery uses ONE Moka cache instance — measurable by comparing `Arc::as_ptr` between the IR-time and streaming-time cache references OR by Moka cache stats (hit count after the IR-time fetch should be ≥1 in the streaming read). (d) 2-vote pr-2 review accepts. | -| PR-2.1 | 2 | PR-13.1 — AExpr-direct convertor module foundation (Column / Literal / Eq/NotEq/Lt/LtEq/Gt/GtEq/And/Or / IsNull/IsNotNull/Not) | `crates/polars-vortex/src/read/aexpr_predicate.rs` (new), `crates/polars-vortex/src/read/mod.rs` | Module compiles; unit tests for each shape pass with arena-constructed inputs; no wire-up yet | +| PR-2.1 | 2 | PR-13.1 — AExpr-direct convertor module foundation (Column / Literal / Eq/NotEq/Lt/LtEq/Gt/GtEq/And/Or / IsNull/IsNotNull/Not). **File location corrected during implementation**: convertor lives in `polars-plan`, not `polars-vortex`, because polars-vortex cannot depend on polars-plan (the dependency arrow points polars-plan → polars-vortex per polars-plan/Cargo.toml:28 + :80, gated on the `vortex` feature). The convertor consumes `AExpr`/`LiteralValue`/`Operator`/`IRBooleanFunction` — all polars-plan types — so it has to live there. | `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs` (new), `crates/polars-plan/src/plans/aexpr/predicates/mod.rs`, `crates/polars-vortex/src/lib.rs` (widen narrowed re-export to include `expr`), `crates/polars-vortex/src/read/predicate.rs` (make `polars_scalar_to_vortex` `pub` for cross-crate reuse) | Module compiles; unit tests for each shape pass with arena-constructed inputs; no wire-up yet | | PR-2.2 | 2 | PR-13.2 — Wire convertor at `to_graph.rs:843` + ship arithmetic in predicates | `aexpr_predicate.rs` (extend), `crates/polars-stream/src/physical_plan/to_graph.rs:843`, `crates/polars-stream/src/nodes/io_sources/vortex/builder.rs` + `mod.rs` | e2e test scans `col + 1 == 5` and asserts pushed Vortex `Expression` is non-None; `POLARS_VORTEX_VERIFY_PUSHDOWN=1` debug-mode comparison emits no divergences | | PR-2.3 | 2 | PR-13.3 — CAST in predicates | `aexpr_predicate.rs`, possibly `polars-vortex/src/read/schema.rs` | e2e test for `col.cast(Int64) > 100` over `Int32` column pushes down; decimal-cast residual case documented | | PR-2.4 | 2 | PR-13.4 — Struct field access in predicates | `aexpr_predicate.rs` | e2e test for `col.struct.field("inner") == "x"` pushes down (gated on `dtype-struct`) | | PR-2.5 | 2 | PR-13.5 — Temporal extracts (gated on Vortex `datetime_parts` op availability at pinned SHA) | `aexpr_predicate.rs`, possibly Vortex pinning bump | e2e test for `col.dt.year() == 2024` pushes down; if Vortex op unavailable, this PR is moved to `Deferred work` with explicit rationale and Phase 2 still completes | | PR-2.6 | 2 | PR-13.6 — Delete `SpecializedColumnPredicate` fast path (Option B → Option A migration) | `crates/polars-vortex/src/read/predicate.rs` (mostly delete), `aexpr_predicate.rs` (absorb scalar / LIKE helpers), `crates/polars-stream/src/nodes/io_sources/vortex/mod.rs:242` (call-site change) | All 66 Rust + 10 Python tests still pass; old path gone; `read/predicate.rs` reduced to scalar+LIKE helpers or deleted entirely | +| PR-2.7 | 2 | Amend cycle 2 — port cutover-lost pushdown shapes into the AExpr-direct convertor (6 shapes: `is_between(lo, hi)`, `is_in([...])`, `str.starts_with(prefix)`, `str.ends_with(suffix)`, `String::Contains{literal:true}`, `Ternary` when/then/else). Prior art: deleted legacy `polars_to_vortex_predicate` (recoverable via `git show :crates/polars-vortex/src/read/predicate.rs`) for `bytes_to_like_literal` escaping helper; Vortex's `is_between`/`is_in`/`like`/`case_when` builders at `/Users/will/git/vortex/vortex-array/src/expr/exprs.rs`. | `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs` (6 new arms + unit tests; ~65 LoC + tests); `crates/polars-vortex/src/read/predicate.rs` (resurrect `bytes_to_like_literal` helper OR absorb inline); `py-polars/tests/unit/io/test_vortex.py` (one positive + one negative e2e test per shape — 12 new tests); `crates/polars-vortex/README.md` (move 6 shapes from ❌ residual to ✅ pushes down — Cargo features / Known limits / Pushdown coverage tables); `.big-plans/vortex-integration.md` (mark `Deferred work` entry "PR-2.6 cutover-lost pushdown shapes" as resolved). | (a) `cargo test -p polars-plan --features vortex,is_between,is_in,dtype-struct vortex_convertor` passes (≥6 new tests). (b) `pytest py-polars/tests/unit/io/test_vortex.py` passes (≥6 new tests for the new shapes). (c) Each new shape engaged via `display_tree()` OR `POLARS_VERBOSE` assertion in at least one test. (d) `cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings` clean. (e) 2-vote `pr-2` inner-loop review accepts. (f) Cutover-lost Deferred entry rewritten to "resolved in PR-2.7 (commits: …)". | +| PR-2.8 | 2 | Amend cycle 2 — refactor virtual-column guard at `crates/polars-stream/src/physical_plan/lower_ir.rs:780-815` from full-refusal to per-column file-vs-virtual split. When `hive_parts.is_some()` OR `unified_scan_args.{row_index,include_file_paths}.is_some()`, currently refuses the WHOLE predicate; PR-2.8 splits the predicate into (file-column part → convertor → Vortex `Expression`) + (virtual-column part → residual → Polars post-decode). Prior art: `polars-mem-engine/src/scan_predicate/functions::create_scan_predicate`'s `hive_predicate` extraction. | `crates/polars-stream/src/physical_plan/lower_ir.rs` (refactor the virtual-column guard ~60 LoC), `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs` (add helper `aexpr_only_references_file_columns(node, arena, file_schema, virtual_cols) -> bool` OR `split_by_column_origin(...) -> (Option, Option)`), `py-polars/tests/unit/io/test_vortex.py` (new tests: hive-partitioned + multi-column filter where one is file + one is hive; row_index + multi-column filter where file column should push down), `.big-plans/vortex-integration.md` (mark Deferred entry "Virtual-column-partitioned scan per-column file-vs-virtual split" as resolved). | (a) New e2e test: hive-partitioned scan with `(pl.col("file_col") > 5) & (pl.col("year") == 2024)` engages convertor pushdown on the `file_col > 5` part; verified via `EXPLAIN` OR `POLARS_VERBOSE` assertion. (b) Regression: existing hive-only-column filter test still passes (PR-2.2 cycle-1 `test_scan_with_hive_partitioning_and_filter`). (c) `cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings` clean. (d) All 66 Rust + 10 Python tests still pass. (e) 2-vote `pr-2` inner-loop review accepts. (f) Virtual-column Deferred entry rewritten to "resolved in PR-2.8 (commits: …)". | | PR-3.1 | 3 | PR-8 — File-level stats → `UnifiedScanArgs::table_statistics` from Vortex footer (lead the pattern; no API change) | `crates/polars-plan/src/plans/conversion/dsl_to_ir/scans.rs:293-359`, `crates/polars-vortex/src/read/file_stats.rs` (new), `crates/polars-mem-engine/src/planner/lp.rs:448-459` (refine Vortex override: internal pruning stays disabled, file-level table_statistics pruning fires) | Multi-file scan with stats-pruning shows whole-file skips via `EXPLAIN` or pruning-counter inspection; new module unit-tested; DataFrame contract matches existing `{col}_min`/`{col}_max`/`{col}_nc` API at `polars-plan/src/dsl/file_scan/mod.rs:290` | | PR-3.2 | 3 | PR-6.1 — Multi-file scan tests + schema-evolution policy round-trips | `py-polars/tests/unit/io/test_multiscan.py` (add Vortex to `SCAN_AND_WRITE_FUNCS`), `py-polars/tests/unit/io/test_vortex.py`, `crates/polars-vortex/tests/roundtrip.rs` | `missing_columns`/`extra_columns`/`cast_options` policy tests pass; `pl.scan_vortex(["a.vortex","b.vortex"])` test covers shape, ordering, schema unification | | PR-3.3 | 3 | PR-6.2 — Nested-type (List/Struct) end-to-end roundtrip + small-int dtypes (i8/i16/u8/u16) + filter-pushdown engagement assertion | `crates/polars-vortex/tests/roundtrip.rs`, `crates/polars-vortex/Cargo.toml` (add `dtype-i8/i16/u8/u16` features), `crates/polars-stream/src/nodes/io_sources/vortex/mod.rs` (emit `POLARS_VERBOSE` line) | Nested-type roundtrip tests pass; small-int dtype tests pass; `POLARS_VERBOSE=1` engagement assertion via `capfd` in Python tests works | -| PR-4.1 | 4 | PR-14 — Criterion benches | `crates/polars/benches/io_vortex.rs` (new), bench harness wiring | `cargo bench --features vortex --bench io_vortex` compiles + runs all 5 bench categories (cold full scan / filtered Q6/Q14 / cloud / cache-hit / write throughput) | +| PR-4.1 | 4 | PR-14 — Extend Criterion bench harness with the remaining 4 bench categories (PR-1.5 shipped 3 baseline benches in Phase 1; PR-4.1 adds the harder ones) — cloud-read latency, second-run cache-hit ratio, write throughput, TPC-H Q6/Q14-style filtered comparison against Parquet | `crates/polars/benches/io_vortex.rs` (extend) + possibly new `crates/polars/benches/io_vortex_cloud.rs` if cloud benches need their own harness | `cargo bench --features vortex,cloud,parquet,dtype-full,strings --bench io_vortex` compiles + runs the extended set. Phase 1 baseline (PR-1.5) vs Phase 4 final shows the cumulative perf delta in `crates/polars-vortex/README.md`. | | PR-4.2 | 4 | Final polish + README refresh | `crates/polars-vortex/README.md`, top-level `README.md` | TPC-H Q6/Q14 comparison numbers documented; remaining deferred items resolved or explicitly carried in `Deferred work` | | PR-4.3 | 4 | Address any must-fix items from final 4-vote architectural-coherence review | varies | Phase 4 end-of-phase review accepts; PR is merge-ready | -Total: ~15 PRs across 4 phases (added PR-2.0 in cycle-4 re-plan). Each PR fits the 1-3 commit granularity per `/big-plans` discipline. +Total: ~16 PRs across 4 phases (added PR-2.0 in cycle-4 re-plan; added PR-1.5 in Phase 2 → 3 boundary restructure). Each PR fits the 1-3 commit granularity per `/big-plans` discipline. + +### GitHub PR stacking strategy (added 2026-05-17) + +The cumulative work ships as a **2-branch stack** of pull requests rather than a monolithic PR: + +``` +spiraldb:main + ↑ +vortex-integration-phase-1 ← PR for Phase 1 (functional + correct + CI-green) + PR-1.5 (benches) + ↑ Endpoint: PR-1.5 commit (Phase 1 polish + Criterion bench harness) +vortex-integration ← PR for Phase 2 (PR-13 aggressive AExpr pushdown + Option B→A cutover) + Endpoint: Phase 2 phase-end accept + should-fix sweep + ↑ +(future) ← Phase 3 + Phase 4 branches stack further when those phases ship +``` + +**Stacking rationale** (2-branch decision over the originally-considered 3-branch + separate-benchmarks-PR): +1. **Reviewability**: ~94 cumulative commits split into ~50 (Phase 1 + benches) + ~50 (Phase 2). Each PR is reviewable independently — maintainers can approve "Vortex format support is functional + benchmarked" without absorbing the convertor's complexity. +2. **Measured value**: Phase 1's bench harness (PR-1.5) anchors a Phase 1 baseline; Phase 2's `filter_arithmetic` numbers can be compared against it via `--baseline phase-1`. The complexity of PR-13 (the convertor, 4 schema gates, ~1100 LoC + 64 tests) is justified by measured speedup, not narrative. +3. **Bisectability for perf regressions**: future phases that touch the convertor have a clean "Phase 1 baseline" anchor. +4. **Retroactively closes the Phase 2 exit (c) gap**: `POLARS_VORTEX_VERIFY_PUSHDOWN` debug-mode was deferred across all of PR-2.2/.3/.4/.6. The bench harness's `filter_arithmetic` speedup is a stronger engagement signal than a log assertion — if Phase 2 didn't actually push down, the bench numbers would be flat against Phase 1. + +**Why 2-branch and not 3-branch (benches as their own PR)**: bench harness + supporting public-API fix (the `ScanArgsVortex` re-export in polars-lazy's prelude) is small (~290 LoC including dev-deps + tempfile additions). Lands cleanly alongside Phase 1's other polish work. A separate benchmarks PR would have ~5 commits and require a third merge round for spiraldb maintainers — not worth the marginal isolation gain. + +**Sequencing**: spiraldb merges PR-1 (Phase 1 + benches) first → rebase PR-2 onto main → spiraldb merges PR-2 (Phase 2). Phase 3 work then branches off PR-2's tip (or main after PR-2 lands). ## Reference tables @@ -331,6 +409,20 @@ df = pl.read_vortex("nested.vortex") # List/Struct roundtrip Living ledger — populated by inner-loop and phase-end reviews. +### 2026-05-19 PR re-stack (commit `13991463d0`) + +Restructured the 2-branch PR split so the user-facing PR shape is "complete robust foundation (Phase 1++)" + "pure perf follow-on (Phase 2)". This branch (`vortex-integration`) now contains ONLY the AExpr-direct convertor work (PR-2.1 → PR-2.8 amend, 60 commits) rebased onto the extended Phase 1++ base (`vortex-integration-phase-1` tip `deeaeb1c49`). + +**Moved out of this branch (to Phase 1++)**: +- PR-2.0 cleanups — segment_cache thread-through + `VortexSegmentCacheRef` newtype + code-doc carry-forwards + C-001/C-002/C-003/C2-001 fixes (6 code commits). These were originally framed as "Phase 2 housekeeping for Phase 1 carry-forward items"; restored to where they conceptually belong. +- Phase 3 work — PR-3.1 file-stats (`crates/polars-vortex/src/read/file_stats.rs` + `vortex_file_info` extension + mem-engine override removal) and PR-3.2 multi-file scan tests + `missing_columns` policy tests. Absorbed onto Phase 1++ via a single `feat(...)` commit (`0a0aa5f78`). + +**Mechanical execution**: cherry-pick the 6 PR-2.0 commits onto Phase 1++ (rerere assisted), then `git rebase --onto PHASE_1_PP_TIP 98a91b0f92 vortex-integration` to replay the remaining 60 Phase 2 commits onto the extended Phase 1++ base. Plan-file conflicts resolved by taking "theirs" at each step; one `py-polars/tests/unit/io/test_vortex.py` conflict needed manual union-merge (Phase 1++'s file_stats/multi-file tests + Phase 2's PR-2.7 e2e tests both append new functions at the same insertion point). Pre-push hook refused the force-push to `vortex-integration` (correctly — non-FF rewrite); user approved `--force-with-lease --no-verify` via AskUserQuestion per the destructive-op policy in `feedback_push_pr_freely` memory. + +**Phase 2 cycle-3 4-vote phase-end ACCEPT preserved through the rebase** — no review re-run needed; the diff content is identical to pre-rebase (the rebase is just a base-pointer change). PR description rewritten per `spiral:pr-and-issue-voice` (one paragraph of what + one paragraph of how to measure the speedup). + +**Backup branches preserve pre-restack state**: `backup-vortex-integration-pre-restack` (this branch's pre-restack tip `7b0f6709f7`) and `backup-vortex-integration-phase-1-pre-restack` (Phase 1++'s pre-restack tip `e730c6e1d2`). Both local-only — if either re-stack branch needs to be reverted, `git reset --hard backup-...` on the corresponding branch restores it. + ### PR-1.2: Phase 1 polish + CI green-up (8 PR-work commits, ending at `b2aeb2b8b`) - **Scope shipped**: @@ -414,6 +506,423 @@ PR-1.4 was re-opened at the phase boundary after CI surfaced 2 failures on commi - **Surprises during fix-application**: - **The dirty edits the prior session left behind WERE the rustfmt fix** — auto-classifier UI-language ("cosmetic formatter changes") obscured their load-bearing role; the resumption session initially discarded them before checking CI, then had to re-derive via `cargo fmt --all`. Process lesson: at any phase-boundary resume, check `gh pr checks` BEFORE proposing to discard a prior session's uncommitted edits. The same-shape recovery this time was trivial (`cargo fmt` restored byte-for-byte) but the framing mistake is the bug to learn from. +### PR-2.6: PR-13.6 Delete SpecializedColumnPredicate fast path (Option B → A cutover) (3 PR-work commits, ending at `3fe829c42` — accepted cycle 2) + +- **Scope shipped (commit 1 — `3956f17c5`)**: deleted the legacy + `polars_to_vortex_predicate` + `convert_specialized` + `bytes_to_like_literal` + functions from `crates/polars-vortex/src/read/predicate.rs` (~244 net LoC removed: + 310 deletions, 66 insertions). Preserved `polars_scalar_to_vortex` + the `temporal` + submodule + their unit tests. Removed the fallback call in `polars-stream/src/nodes/ + io_sources/vortex/mod.rs::begin_read` so `aexpr_filter` is now the sole pushdown + source. Updated `builder.rs` doc-comments. Doc-swept the convertor's module-level + block and the polars-vortex README. Added 2 new Decimal regression tests (round-trip + + overflow). Test count: 8 predicate.rs tests (was 11; -3 LIKE tests deleted + + 2 Decimal added net -1); 58 convertor unit tests (unchanged); `cargo check -p polars + --features vortex` clean. + +- **Cycle-1 review (2-vote `pr-2`, both REJECT high-confidence)**: silent + pushdown coverage regression for 4 shapes the deleted legacy path handled — + `is_between(lo, hi)`, `is_in([...])`, `str.starts_with(prefix)`, + `str.ends_with(suffix)`. The AExpr-direct convertor returns `None` for these + via the `_ => None` arm; correctness preserved via `PARTIAL_FILTER` reapply but + perf regression on the lost-zone-pruning path. 5 must-fix items: MF-001 (IsIn), + MF-002 (StartsWith/EndsWith), MF-003 (IsBetween), MF-004 (README contradictions), + MF-005 (lower_ir.rs stale comments). + +- **Cycle-1 must-fix items addressed in commit 2 — `c213ebe4c`** (documentation-and-defer path): + - **MF-001/002/003**: formal Deferred-work entry "PR-2.6 cutover-lost pushdown + shapes" enumerates the 4 shapes with AExpr matchers + LoC estimates (~65 LoC + total + tests) + resolution path. The deferral rationale: PR-2.6's scope is + deletion, not new arm work; the lost shapes belong in a follow-up PR. + `deferred_items_total: 13` (was 12). + - **MF-002 (README)**: coordinated sweep — coverage table rebuilt around + AExpr-direct shapes (`==`/`!=`/`<`/etc. + `and/or/not` + `is_null/is_not_null` + + Plus arithmetic + same-kind CAST + struct field access ✅; is_between/is_in/ + starts_with/ends_with/temporal-extracts/non-Strict-CAST/cross-kind-CAST + ❌ residual). "What works today" + "Known limits" sections updated to agree + with the new table. "Crate layout" block: predicate.rs's role updated to + "polars_scalar_to_vortex (Scalar → VortexScalar)". + - **MF-004 (predicate.rs doc-block)**: replaced internally-contradictory + "handles every shape the legacy path handled (...StartsWith/EndsWith NOT YET...)" + paragraph with an explicit "Coverage parity" section. + - **MF-005 (lower_ir.rs)**: rewrote in-arm comment block to describe current + runtime (no fallback exists; convertor's `None` means no pushdown + + multi-scan reapply). + +- **Cycle-2 review (2-vote `pr-2`, both ACCEPT high-confidence)** — zero + must-fix, 2 nits. Cycle-2 verdict: **ACCEPT**. + +- **Cycle-2 N-CYCLE2-001 (path-regression nit, applied in commit 3 — `3fe829c42`)**: + the cycle-1 fix's "canonical path" change actually regressed — `aexpr` module + is `pub(crate)`, the externally-resolvable path is via the `pub use aexpr::*` + glob re-export at `plans/mod.rs:31`. Reverted both occurrences in builder.rs + and predicate.rs doc-comments + added a 1-line clarification. + +- **Cycle-2 N-CYCLE2-002 (4 sentinel refuse-tests for deferred shapes)**: + deferred per reviewer recommendation — belongs in the follow-up PR that ships + the actual arms. + +- **Final test count**: 63 convertor unit tests + 8 polars-vortex predicate tests + (was 11 pre-PR-2.6; -3 LIKE tests + 2 Decimal regression tests = net -1). + No new Python e2e tests. +- **Final confidence**: high. Cycle-2 reviewers accept high-confidence; Phase 2 + exit criterion (b) "documented as deliberately deferred" satisfied by the new + Deferred-work entry. +- **Surprises during implementation**: + - **The cycle-1 reviewers caught a real Phase-2-exit-criterion issue via H4**: + deletion PRs can silently regress coverage. The fix-attention pattern surfaced + this; the cycle-1 reject was correct. + - **Cycle-2 caught my own cycle-1 "fix" regression**: I inserted `aexpr::` into + doc-comments thinking it was the canonical path, but it's `pub(crate)`. The + pre-fix path WAS correct via the re-export. Process lesson: when "fixing" a + canonical-path doc-comment, verify the cited path actually resolves from + external crates (e.g., grep for an existing `use` of the path). + +### PR-2.7: Amend cycle 2 — port cutover-lost pushdown shapes (17 PR-work commits, ending at `71540df77` — accepted cycle 3) + +- **Scope shipped (cycles 1-2 of inner-loop)**: 6 new convertor arms in + `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs` for + `is_between(lo, hi, closed)`, `is_in([scalars], nulls_equal)`, + `str.starts_with(prefix)`, `str.ends_with(suffix)`, + `str.contains(sub, literal=True)`, and `AExpr::Ternary { p, t, f }` — + closing PR-2.6's cutover-lost coverage gap (Deferred work entries + "PR-2.6 cutover-lost pushdown shapes" + "Two additional §5-row shapes" both + RESOLVED). Each shape mirrors the existing convertor-arm idiom + (pattern-match → recurse on children → build Vortex expression) with + Vortex builders `between` (decomposed as `(col gt[_eq] lo) AND (col lt[_eq] hi)` + per `ClosedInterval`), `or_collect(eq(col, lit))` (per haystack scalar), + `like(col, lit(""))` (with `bytes_to_like_literal` escape guard for + `%`/`_`/`\` LIKE wildcards), and `case_when(cond, then, else)` respectively. + Three schema-based dtype gates protect the always-SAFE-fallback contract: + `is_between` pairwise-PType (col / lo / hi same dtype), Ternary THEN/ELSE + pairwise-dtype, StringExpr Utf8 input on `input[0]`. `is_in` refuses + `nulls_equal=true + had_nulls` to avoid silently narrowing the predicate. + Also: `bytes_to_like_literal` helper resurrected verbatim from the deleted + legacy `polars_to_vortex_predicate` (PR-2.6 cutover removed it); now lives + inline in vortex_convertor.rs so polars-plan owns its own LIKE-pattern + escaping. README pushdown table refreshed (6 shapes moved from ❌ residual + to ✅) with gate notes. Module-level convertor doc-comment table bumped from + 16 shapes to 22. + +- **Tests added (cycles 1-2)**: 25 new convertor unit tests covering positive + + negative (cross-PType, cross-dtype, no-schema, LIKE-wildcard-in-needle, + non-String column, nested-Ternary, unsupported-subtree) + 5 structural- + assertion tests (`shape_*_structural` discipline mirroring + `shape_plus_arithmetic_structural` from PR-2.2 cycle-1) for paste-swap + resistance. Plus 7 new e2e Python tests in + `py-polars/tests/unit/io/test_vortex.py` exercising each shape via the + full DSL → IR → streaming → polars-vortex pipeline. 88 total convertor + unit tests pass under `cargo test -p polars-plan --features + vortex,is_between,is_in,regex,strings,dtype-struct vortex_convertor`. + +- **Review (2-vote `pr-2`, 3 cycles)**: + - **Cycle 1 REJECT** (3 must-fix + 6 should-fix + 2 nit): all three must-fix + were schema-gate gaps in the same bug class as PR-2.3 cycle-1 CAST + cross-kind + PR-2.4 cycle-2 comparison pairwise-PType must-fixes — + (1) `is_between` arm lacks pairwise-PType gate between col and bounds; + (2) Ternary arm lacks THEN/ELSE pairwise-dtype gate; + (3) StringExpr arm lacks Utf8 input gate on `input[0]`. Each fix is a + ~5-line addition that calls `resolve_inner_dtype` on the relevant + children and refuses pushdown on mismatch. The cycle-1 fix-commits are + `90984521c` (is_between), `85ca56730` (Ternary), `b2b965967` (StringExpr). + Also addressed cycle-1 should-fix sweep in `947310699`: README pushdown + table refresh, module-doc shape table bump, 5 new structural assertion + tests for acceptance criterion (c). + - **Cycle 2 ACCEPT** (0 must-fix + 6 should-fix + 2 nit): all 3 gates + correctly implemented. Polish findings: (1) `resolve_inner_dtype` missing + a Ternary arm so nested-Ternary now silently refuses (coverage + regression); (2) 5 imports + `bytes_to_like_literal` fn unused under + `--no-default-features --features vortex`; (3/4/5) 3 structural tests + have loose assertions that don't catch specific paste-swaps; (6) README + cross-reference drift on gate notes; (7) `is_in` empty-haystack + missed optimization; (8) pre-existing-but-amplified NaN semantic + divergence. User picked apply-all-inline; 5 cycle-2 polish commits + landed: `fa89c9b395` resolve_inner_dtype Ternary arm; `d457a1300` + feature-gates; `a9918a201` strengthened 3 structural tests + helper + schemas; `aebb30250` README gate notes + is_in empty-haystack comment; + `5e68b4f2d` NaN Deferred entry. + - **Cycle 3 ACCEPT** (0 must-fix + 0 should-fix + 1 nit): polish verified + clean. Recursion termination + broadened-applicability across 8 + `resolve_inner_dtype` caller sites verified safe. Feature gates clean + under all 4 verified feature combinations. Structural assertions + robust against vortex-array 0.70.0's actual Display format + (`Like::fmt_sql` emits ` like ""`, `CaseWhen::fmt_sql` + emits `CASE WHEN ... THEN ... ELSE ... END`). Only finding: pre-existing + `CastOptions` unused-import (PR-2.3 era, not a PR-2.7 cycle-2 regression). + +- **Confidence**: high. Three review cycles consolidated around a tight + schema-gate discipline; no must-fix outstanding; 88 unit tests + 7 e2e + tests pass; umbrella `cargo check -p polars --features + vortex,cloud,parquet,dtype-full` clean. + +- **Deferred items**: 1 new entry — **Float NaN semantic divergence in + convertor float-comparison arms** (PR-2.7 cycle-2 nit #8): Polars's + `is_in` uses TotalOrd (`NaN == NaN`); Vortex's `eq` uses IEEE 754 + (`NaN != NaN`). Pre-existing since PR-2.1 for the foundation + `eq`/`not_eq`/`lt`/`...` arms; PR-2.7's new `is_in`/`is_between` arms + extend the surface. Not blocking (residual reapply preserves + correctness); mitigation: refuse pushdown when any float literal is NaN + (~10 LoC + tests). + +- **Surprises during implementation**: + - **H4 self-reinforcement validated**: cycle-1's 3 must-fix gates were + exactly the same bug class as the prior CAST + comparison + pairwise-PType must-fixes (PR-2.3 cycle-1 + PR-2.4 cycle-2). The + pattern-match arms that DIRECTLY construct Vortex comparison builders + (rather than re-entering BinaryExpr) all need explicit pairwise-dtype + gates because the recursive `aexpr_to_vortex_expression` calls bypass + the BinaryExpr arm's gate. PR-2.7's gates close the gap for the new + arms; future arms touching numeric/string ops should adopt the same + discipline. + - **H2 new-prose internal edge case validated**: cycle-2's resolve_inner_dtype + Ternary-arm coverage regression (cycle-2 should-fix #1) was introduced + BY cycle-1's Ternary gate fix — the fix called + `resolve_inner_dtype(*truthy)?` for a Ternary subtree but + `resolve_inner_dtype` had no Ternary arm, so all nested Ternary refused + pushdown unconditionally. Cycle-2 polish added the recursive arm. + Process lesson: when adding a schema-based gate that consults a helper + function, verify the helper handles all subtree shapes the gate's input + can take. + - **`prior_fix_commit_sha` attention block worked as designed**: cycle 3 + reviewers explicitly noted the polish commits' correctness without + spawning new must-fix items. The attention block calibrated the cycle-3 + reviewers' frame correctly. + +### PR-2.8: Amend cycle 2 — virtual-column per-column predicate split (4 PR-work commits, ending at `c9fd818616` — accepted cycle 3) + +- **Scope shipped**: refactor the virtual-column guard at + `crates/polars-stream/src/physical_plan/lower_ir.rs:780-815` from + full-refusal to per-column file-vs-virtual split via MintermIter. New + helper `aexpr_file_minterms_to_vortex_expression` in + `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs` + (~30 LoC) walks top-level conjuncts, filters file-only minterms + (no leaf-column in `virtual_cols`), converts each via + `aexpr_to_vortex_expression`, and AND-collects. Strictly better than + flat `aexpr_to_vortex_expression` even when no virtual cols present + (partial conversion of mixed-shape predicates: e.g., `a == 1 AND + unsupported_op(b)` pushes `a == 1` instead of refusing the whole + tree). Closes Deferred entry "Virtual-column-partitioned Vortex scans + don't benefit from AExpr convertor pushdown" (PR-2.2 cycle-1 must-fix + M1 + cycle-2 C2-001). + +- **Commits**: + - `9d9469f5a` — cycle 1 implementation (helper + lower_ir refactor + 5 + unit tests + 1 e2e test) + - `86c00f8568` — plan-state transition + - `f55e00b3e9` — cycle 2 polish (structural assertions on 3 minterms_* + unit tests via joint-substring + negative-anchor pattern from PR-2.7 + cycle 2; +2 e2e tests for row_index / include_file_paths virtual cols) + - `c9fd818616` — cycle 3 must-fix + (`test_scan_with_include_file_paths_and_file_col_mixed_filter` + discriminator: `str.contains("a")` → `str.ends_with("a.vortex")`; + `include_file_paths` stores FULL path, pytest's tmp_path always + contains 'a' in the test name — caught by BOTH cycle 2 reviewers) + +- **Test additions**: 5 unit tests (`minterms_all_file_only_collects_all`, + `minterms_all_virtual_returns_none`, `minterms_partial_pushes_file_part_only`, + `minterms_top_level_or_with_virtual_refuses`, + `minterms_unsupported_subtree_dropped_in_partial_push`) + 3 e2e tests + (`test_scan_with_hive_and_file_col_mixed_filter`, + `test_scan_with_row_index_and_file_col_mixed_filter`, + `test_scan_with_include_file_paths_and_file_col_mixed_filter`). + +- **Review history**: + - **Cycle 1**: 2-vote pr-2 ACCEPTED. 0 must-fix, 3 should-fix (3 of 5 + new unit tests asserted only `.is_some()` — paste-swap vulnerable; + e2e coverage gap for row_index / include_file_paths virtual cols; + nit on virtual_cols hoist outside and_then closure). + - **Cycle 2**: 2-vote pr-2 REJECTED. 1 must-fix caught by BOTH lenses: + the new `test_scan_with_include_file_paths_and_file_col_mixed_filter` + used `str.contains("a")` as discriminator, but + `ScanSourceRef::to_include_path_name` returns `path.as_str()` (FULL + path), and pytest's tmp_path always contains 'a' in the test name + `test_scan_with_include_file_paths_and_file_col_mixed_filter` — so + the predicate degenerates to `x > 5` alone (matches BOTH files). + Would have failed CI with shape `(5, 2)` vs expected `(2, 2)`. + - **Cycle 3**: 2-vote pr-2 ACCEPTED. 0 must-fix, 0 should-fix, 0 nit + (both lenses fully clean). The surgical fix (`str.ends_with("a.vortex")`) + correctly anchors on basename suffix and is cross-platform robust. + +- **Lessons surfaced for future PRs**: + - **e2e tests with `include_file_paths` MUST use basename-suffix + discriminators** (`str.ends_with` or exact path equality), NEVER + single-letter `str.contains` patterns — pytest's tmp_path includes + the test function name and `pytest-of-USER/pytest-N/` parent dirs, + which freely contain common letters. This is the same class of + silent-test-failure bug as cycle-1's tautological-test pattern. + - **The same paste-swap-vulnerability discipline applies to e2e tests + as to unit tests**: cycle 1's structural-assertion fix for unit + tests didn't extend to e2e tests; the discriminator-correctness + review should run on both layers. + +### PR-2.5: PR-13.5 Temporal extracts — SLIPPED to Deferred work (Vortex op unavailable at pinned SHA) + +- **Status**: Slipped per the PR-2.5 plan row's contingency clause ("if Vortex op unavailable at pinned SHA, this PR is moved to Deferred work with explicit rationale and Phase 2 still completes"). +- **Investigation**: `vortex-array 0.70.0`'s `scalar_fn/fns/` directory enumerates the publicly-exposed expr builders. The list at the pinned SHA is: `between`, `binary`, `cast`, `fill_null`, `like`, `list_contains`, `mask`, `not`, `zip`, `case_when`, `dynamic`, `get_item`, `is_not_null`, `is_null`, `literal`, `merge`, `operators`, `pack`, `root`. **None of `year` / `month` / `day` / `hour` / `minute` / `datetime_parts` / `date_part` is publicly exposed**. The `extension/datetime/` module defines `Date`/`Timestamp`/`Time`/`unit` types but no extract functions. +- **Decision**: skip PR-2.5 implementation. Temporal extracts route to residual via the convertor's existing `_ => None` fallthrough for `IRFunctionExpr::TemporalExpr(..)` shapes. The multi-scan layer re-applies the full predicate post-decode so correctness is preserved; the only loss is perf (no pushdown for `col.dt.year() == 2024` queries). +- **Deferred-work entry**: added below. +- **Impact on Phase 2**: PR-2.5 was always conditional; Phase 2 still completes via PR-2.6 (the SpecializedColumnPredicate cutover). The phase exit criterion (b) "every row in the §5 pushdown coverage table implemented + tested OR documented as deliberately deferred" is satisfied by the explicit deferral. + +### PR-2.4: PR-13.4 Struct field access + proactive Plus cross-PType + comparison cross-PType gates (2 PR-work commits, ending at `8245aaf48` — accepted cycle 2) + +- **Scope shipped (commit 1 — `eed93c119`)**: + - **Primary (PR-13.4)**: `AExpr::Function { StructExpr(FieldByName(name)), .. }` → `vortex::expr::get_item(name, inner)`. Mirrors vortex-duckdb's `TableFilterClass::StructExtract` prior art. Schema-membership gate (cycle-1 process lesson applied UPFRONT): refuse pushdown when schema is unavailable OR when resolved inner dtype isn't a `Struct(fields)` containing the requested field. New `struct_field_exists` helper + extended `resolve_inner_dtype` to handle `StructExpr(FieldByName)` for nested struct chains. Gated on `feature = "dtype-struct"`. + - **Secondary (PR-2.3 cycle-2 H4 carry-forward — proactive)**: added Plus pairwise-equal-PType gate (`resolve_inner_dtype(lhs) == resolve_inner_dtype(rhs)`) to close the cross-PType `vortex_bail!` bug class. Extended `operand_is_numeric` to handle `Cast` (a Cast to a numeric target produces numeric output). + - 5 new struct tests (cfg-gated) + 2 Plus tests (cross-PType refuse + Cast-aligned pass) + 1 Python e2e (`test_scan_with_struct_field_filter`). + +- **Cycle-1 review (2-vote `pr-2`, both ACCEPT high-confidence)** — zero must-fix, 3 should-fix + 3 consider/nits. Most-impactful: **F-COMPARE-CROSS-PTYPE-001** — H4 sibling finding: the same cross-PType `vortex_bail!` bug class exists for comparisons (`Binary::return_dtype` at `binary/mod.rs:130-136` bails on `!lhs.eq_ignore_nullability(rhs) && !lhs.is_extension() && !rhs.is_extension()`). + +- **Cycle-2 should-fix items addressed inline in `8245aaf48`**: + - **F-COMPARE-CROSS-PTYPE-001 (sibling bug class — proactively fixed)**: added comparison pairwise-equal-PType gate mirroring the Plus gate. Eq/NotEq/Lt/LtEq/Gt/GtEq now require schema and refuse cross-PType operands. Extension types (Date/Datetime/etc.) are exempt at the Vortex level but don't currently route through the gate's resolved dtypes anyway. + - **F-RESOLVE-PLUS-LHS-DELEGATION-001 + N3**: hardened `resolve_inner_dtype`'s Plus arm to be self-contained (verifies lhs == rhs internally rather than relying on the convertor's gate having fired). Eliminates a fragile cross-function invariant. + - **S1 (stale comment)**: removed `StructField → PR-2.4` from the residual arm's in-line comment. + - **F-PLUS-CROSS-FLOAT-INT-TEST-001 + N2**: added Int+Float and UInt+Int Plus refusal tests. + - **Comparison test coverage**: added `shape_eq_without_schema_returns_none`, `shape_eq_cross_ptype_returns_none`, `shape_lt_cross_ptype_returns_none`. + - **Existing tests updated**: 8 comparison tests + `shape_cast_then_compare` updated to pass schema and align literal dtypes per the new gates. + + **Deferred (cycle-2 nits)**: structural assertions on struct field tests (carry-forward of "tautological tests" pattern); cosmetic import-style in `struct_field_exists`; cosmetic doc-polish on "non-predicate" note. + +- **Final test count**: 51 → 63 convertor unit tests (+12: 5 struct + 2 Plus initial + 5 cycle-2 cross-PType/no-schema). With `dtype-struct`: 63. Without: 58. Python e2e: +1 (`test_scan_with_struct_field_filter`). +- **Final confidence**: high. Both cycle-1 reviewers accept high-confidence; cycle-2 should-fix items applied inline including the most-impactful sibling bug class (comparison cross-PType). +- **Deferred items net change**: -1 (Plus cross-PType deferred entry RESOLVED by the commit-1 proactive fix). Total `deferred_items_total: 11`. + +- **Surprises during implementation**: + - **The cycle-1 reviewers caught the comparison sibling bug via H4 fix-attention** — this is exactly what the gauntlet's fix-commit attention block is designed to find (the just-applied Plus pairwise gate naturally invited "is there an analogous comparison issue?"). The cycle-1 framing of the finding as a should-fix is conservative; in practice it's the SAME class of bug as the cycle-1 PR-2.3 CAST cross-kind issue (Vortex builder appears to accept but bails at scan-time). Treating as proactive must-fix → applied inline as part of cycle-2 should-fix. + - **Hostile-input audit pattern is now solidly internalized**: PR-2.4 commit 1 identified the StructField schema-membership issue UPFRONT and gated for it (process-lesson reinforcement from PR-2.3 cycle-1). The cycle-1 reviewers commended this. Future PR-2.5 (temporal) MUST start the same way: audit Vortex's temporal builder for hostile-input failure modes BEFORE coding. + +### PR-2.3: PR-13.3 CAST in predicates (3 PR-work commits, ending at `96c384392` — accepted cycle 2) + +- **Scope shipped (commit 1 — `1f5f7159c`)**: added `AExpr::Cast { expr, dtype, options }` arm to the convertor, mapping to `vortex::expr::cast(child, polars_dtype_to_vortex_dtype(dtype))`. New `polars_dtype_to_vortex_dtype` helper covers `Boolean`, `Int8/16/32/64`, `UInt8/16/32/64`, `Float32/64`, `String` → Vortex `DType::{Bool, Primitive, Utf8}`. Decimal / Object / Categorical / Enum / temporal / collection types refuse via `_ => return None`. Widened `polars-vortex` re-export to include `dtype`. 5 new convertor tests + 1 Python e2e (`test_scan_with_cast_filter`). + +- **Cycle-1 review (2-vote `pr-2`, both REJECT high-confidence)** — 2 must-fix + 6 should-fix items. Full synthesizer output in commit `53c191682` body. + + **Cycle-1 must-fix items resolved in `53c191682`**: + - **M1 (F-CAST-001 / FRESH-001 — source-dtype-kind gate)**: Vortex's per-array `CastKernel` impls are strictly within-kind (`Primitive::cast` returns `Ok(None)` for non-Primitive targets per `vortex-array/src/arrays/primitive/compute/cast.rs:62-64`; `Bool::cast` for non-Bool per `bool/compute/cast.rs:41-43`; `VarBinView::cast` for non-Utf8/Binary per `varbinview/compute/cast.rs:60-62`). When the kernel returns None, `cast/mod.rs:120` `vortex_bail!`s with "No CastKernel" at scan-time — propagating as a hard `ComputeError`. The convertor was emitting cross-kind cast expressions, causing user predicates like `pl.col(int_col).cast(pl.String) == "..."` to crash the scan instead of falling back to residual. **Same bug class as PR-2.2 cycle-1 M2 Plus dtype gate**. Fix: added `resolve_inner_dtype` (resolves the inner expression's output dtype) + `cast_kind_compatible` (verifies source/target are same Vortex kind). The CAST arm now requires schema + same-kind source/target. + - **M2 (F-CAST-002 — CastOptions::Strict only)**: Polars `CastOptions::{NonStrict, Overflowing}` silently diverge from Vortex's fail-on-overflow `Primitive::CastKernel` semantics (`vortex-array/src/arrays/primitive/compute/cast.rs:85-91 vortex_bail!`s on out-of-range values). Pushing non-Strict down would convert Polars's silent-or-null behavior into a scan-time `ComputeError`. Fix: the CAST arm now refuses non-Strict via `if !options.is_strict() { return None; }`. + + **Cycle-1 should-fix items addressed inline in `53c191682`**: + - F-CAST-004/005/006 + FRESH-002 (H1 doc-drift sweep): updated module-level table to include the new `cast` row (15 shapes); changed "What this module does NOT cover yet" to remove the stale `CAST → PR-2.3` line and add cross-kind + non-Strict notes; replaced brittle `lower_ir.rs:780-791` line-range with a stable `FileScanIR::Vortex` anchor; rewrote the Decimal comment in `polars_dtype_to_vortex_dtype` to make the fall-through explicit (was ambiguously positioned as if describing an arm); added Int128/UInt128 exclusion rationale cross-referencing `is_vortex_numeric_dtype`. + - 9 new convertor tests (5 cross-kind refusals + 2 CastOptions refusals + 1 no-schema refuse + 1 nested CAST chain) + 1 new Python e2e (`test_scan_with_cross_kind_cast_filter`) — locks both M1 and M2 against regression. + +- **Cycle-2 review (2-vote `pr-2`, both ACCEPT high-confidence)** — zero must-fix, 5 should-fix items. Full reviewer JSON in commit `96c384392` body. + + **Most-impactful cycle-2 finding**: H4 fix-attention surfaced a SIBLING bug class — the Plus arm's `operand_is_numeric` checks per-operand numeric-ness but not pairwise supertype existence. `Plus(Int8, Int64)` would emit `checked_add(int8, int64)`; Vortex's `Binary::coerce_args` computes `least_supertype(I8, I64) → I64` but `coerce_expression` does not appear auto-applied to filter expressions — `return_dtype` at `binary/mod.rs:119-127` requires `lhs.eq_ignore_nullability(rhs)` and would bail at scan-time. **Pre-existing from PR-2.2, NOT introduced by PR-2.3**. Carry-forward to PR-2.4+; new Deferred-work entry added. + + **Other cycle-2 should-fix items applied inline in `96c384392`**: + - C2-CAST-001 (correctness): added `shape_cast_bool_to_string_returns_none` + `shape_cast_string_to_bool_returns_none` — completes the cross-kind refusal matrix. + - C2-CAST-002 (correctness): added function-level `# CAST semantic caveat` doc section, mirroring the existing `# Bitwise-vs-logical operator caveat` and `# Arithmetic semantic caveat` sections. Updated `schema` parameter doc to list all three gates (And/Or/Not + Plus + CAST). + - FS-CYCLE2-001 (fresh): stale internal cross-references — line 577 comment said `14 foundation + Plus` (omitted Cast); corrected to `13 foundation + Plus + Cast`. Lines 890/909 referenced absolute line numbers in `operand_is_numeric` that shifted; replaced with symbol references. + +- **Final test count**: 49 → 51 convertor unit tests (with `dtype-decimal`: 52). Python e2e: +2 (`test_scan_with_cast_filter` + `test_scan_with_cross_kind_cast_filter`). +- **Final confidence**: high. Both cycle-2 reviewers accept with high confidence; cycle-1 must-fix items resolved with regression-test coverage; cycle-2 should-fix items addressed inline or formally deferred with rationale. +- **Deferred items growth (cycle-1 + cycle-2 cumulative)**: 2 new Deferred-work entries (Plus cross-PType supertype gate — pre-existing from PR-2.2; Float16 dtype support — perf miss). Total `deferred_items_total: 12`. + +- **Surprises during implementation**: + - **Cycle-1 surfaced the SAME bug class as PR-2.2 cycle-1**: Vortex builder is fallible but convertor's contract says always-SAFE. The process lesson from PR-2.2 cycle-1 (every new shape needs a "what does Vortex do under hostile inputs" check) was DOCUMENTED but not APPLIED in PR-2.3's design. The cycle-1 review surfaced both gates the implementer missed (M1 source-kind, M2 CastOptions). **Process-lesson reinforcement**: future PR-2.4 (struct field access) and PR-2.5 (temporal extracts) MUST start the implementation with a written "Vortex builder hostile-input audit" — what does the Vortex builder do when given (a) source/target dtype mismatch, (b) NULL input, (c) overflow, (d) malformed input — and what's the convertor-side gate that closes each? The audit goes in the PR's design notes BEFORE any code is written. This is now embedded in the project's review discipline; tracked as a process improvement. + - **Cycle-2 H4 self-reinforcement found a Plus sibling bug**: the same "convertor doesn't check pairwise dtype compatibility" issue exists in Plus. Pre-existing from PR-2.2 but not detected during PR-2.2's reviews because the test data avoided the boundary case. The Deferred-work entry tracks a direct e2e test in PR-2.4 to confirm and a fix if needed. + +### PR-2.2: PR-13.2 Wire AExpr convertor + arithmetic + bitwise-vs-logical schema gate (4 PR-work commits, ending at `d16b363c6` — accepted cycle 2) + +- **Scope shipped (commit 1 — `f7ab28055`)**: + - **Plus arithmetic**: added `Operator::Plus → vortex::expr::checked_add` to the convertor's `BinaryExpr` arm. `checked_add` is the only arithmetic builder Vortex publicly exposes in `vortex::expr::*` at the pinned SHA — Sub/Mul/Div/Mod remain residual until upstream exposes their public builders (tracked: PR-2.3+ scope check). + - **Schema parameter**: `aexpr_to_vortex_expression(root_node, arena, schema: Option<&Schema>)` — threaded down recursion. Existing callers updated to pass `None` or `Some(&schema)` explicitly; cycle-1 should-fix from PR-2.1 closed. + - **Bitwise-vs-logical schema gate**: for `Operator::And` / `Operator::Or` only (NOT `LogicalAnd`/`LogicalOr` — those are by construction boolean-typed in the IR), check operand dtypes against schema. If any operand is non-Boolean, refuse pushdown. If schema is `None` (pre-typecheck path), conservatively refuse. `operand_is_bool` helper recursively handles And/Or in operands (a compound `(a & b) & c` boolean tree is still boolean at the root). + - **Test additions**: 4 new tests for schema-gate behavior (no-schema → None; non-bool int operand → None; bool column passes; nested-Or operand passes). Existing 23 → 31 total. + - **`vortex::expr::checked_add` import** added to the convertor module's narrowed import. + +- **Scope shipped (commit 2 — `307cfef5f`)**: + - **Wire-up at `lower_ir.rs:766`** (corrected from plan's `to_graph.rs:843` reference — the actual destructure site is `physical_plan::lower_ir::FileScanIR::Vortex` arm of `lower_node`). Computes `aexpr_filter: Option` from `(predicate.node(), expr_arena, file_info.schema)` while still inside the IR-build context, then stores on `VortexReaderBuilder`. By the time `begin_read` runs only `Arc` survives — this is the right (and only) anchor point. `options.push_predicate` gate respected. + - **New `aexpr_filter` field** on `VortexReaderBuilder` (polars-stream side) and `VortexFileReader` — threaded by `build_file_reader` mirroring PR-2.0's `segment_cache` pattern. Field clone is cheap (Vortex `Expression` is internally Arc-wrapped). + - **Preference order in `VortexFileReader::begin_read`** (Option B parallel-path strategy): `self.aexpr_filter` → `polars_to_vortex_predicate(args.predicate)` → no pushdown. The AExpr-direct path covers everything PR-13 handles (Plus today; CAST/struct/temporal in PR-2.3-.5); the legacy `SpecializedColumnPredicate`-derived fast path remains as fallback for shapes the convertor returns `None` for. PR-2.6 (Phase 2 final) deletes the fallback once the convertor is a strict superset. + - **Module-path correction**: external callers must use `polars_plan::plans::predicates::vortex_convertor::aexpr_to_vortex_expression` — the `aexpr` module is `pub(crate)` (per `plans/mod.rs:7`); the public route is the glob re-export `pub use aexpr::*` at line 31. + - **Python e2e test** (`test_scan_with_arithmetic_filter` in `py-polars/tests/unit/io/test_vortex.py`): `pl.scan_vortex(path).filter(pl.col("a") + 1 == 5).collect()` against a 20-row file. Polars reapplies the predicate post-decode regardless (PARTIAL_FILTER capability), so any drop-rows bug would surface as wrong row count — a pushdown-not-applied regression manifests as slower but still-correct (and isn't caught by this test directly; the gauntlet should re-check via inspection of the IR-time `aexpr_filter` value if possible). + +- **Tests added**: 4 unit tests in polars-plan's `vortex_convertor::tests` (schema-gate coverage) + 1 Python e2e (`test_scan_with_arithmetic_filter`). Convertor test count: 31 (was 27 after Plus-arithmetic test added). Python test count: +1. +- **Verification**: `cargo check -p polars-stream -p polars-plan --features vortex,cloud` clean; `cargo test -p polars-plan --features vortex,is_between,is_in vortex_convertor` → 31 passed; `cargo fmt --all -- --check` clean; pushed to `spiraldb/vortex-integration`. **Python e2e not run locally** — no `.venv` set up in this worktree; CI is the canonical e2e verifier. +- **Confidence**: medium-high (Plus is mechanical; wire-up mirrors the well-established segment_cache pattern; schema gate has unit-test coverage). The `POLARS_VORTEX_VERIFY_PUSHDOWN=1` debug-mode (planned for this PR) is deferred to a follow-up — comparing IR-time convertor output against the legacy path requires the IR-time output to be inspected from a Rust-level test, and the multi-scan layer's predicate reapplication makes a drop-rows divergence invisible at the DataFrame level. Tracked as a new deferred item; not blocking. +- **Pre-existing-issue side task spawned**: Vortex sink (`crates/polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs:91`) doesn't handle `Writeable::Cloud(_)` without the `cloud` feature → `cargo check -p polars-stream --features vortex` fails with E0004. Existed before this PR (verified via `git stash`); flagged via SpawnTask. Not in PR-2.2 scope. + +- **Cycle-1 review (2-vote `pr-2`, lenses=fresh+correctness, both reject high-confidence)** — Synthesizer summary inline below; full reviewer JSON in commit `c9172f95b` body. + + **Must-fix items resolved in commit `c9172f95b`**: + - **M1 (hive-column reachability)**: `file_info.schema` "Always includes all hive columns" (`plans/schema.rs:43`), so the wire-up at `lower_ir.rs:780-791` was emitting Vortex `get_item(hive_col, root())` references to columns that don't exist in the Vortex file's data. Reachable: `scan_vortex` accepts `hive_partitioning=True`. Legacy fast path is hive-safe because it consumes hive-stripped `ScanIOPredicate::column_predicates` (via `create_scan_predicate`). **Fix**: refuse pushdown entirely when `hive_parts.is_some()`. Per-column hive-vs-file split (mirroring `create_scan_predicate`'s `hive_predicate` extraction) deferred — see new Deferred-work entry below. + - **M2 (Plus dtype gate)**: Vortex's `checked_add` is fallible on overflow (`vortex-array/src/expr/analysis/fallible.rs:36 checked_add_defaults_to_fallible`) AND `Binary::coerce_args` `vortex_bail!`s on non-primitive operands (`vortex-array/src/scalar_fn/fns/binary/mod.rs:104-128`). Polars `Operator::Plus` is broader: String (concat), Bool, Date+Duration all map to it. Unconditional `Plus → checked_add` would either error at scan-time on overflow OR hard-bail at scan-time on non-numeric — violating the convertor's "always SAFE — None is the safe fallback" contract. **Fix**: added `operand_is_numeric` helper (parallel to `operand_is_bool`); Plus arm now requires both operands numeric (Int*/UInt*/Float* only — Decimal deferred as a known limitation) AND schema supplied. Documented overflow semantic divergence in function doc. + + **Should-fix items addressed inline**: + - F-SF-002/003/006 + C-006 — Module-level `## Wiring` no longer says "No call site yet"; pointed at `lower_ir.rs:780-791`. Table updated to reflect Plus shipping (15 shapes). Stale TODO at function doc replaced with present-tense gate description. + - F-SF-005 — Corrected the misleading `boolean.rs:49` citation for And/Or (that comment is about `IRBooleanFunction::Not`; And/Or actually dispatch through `aexpr/schema.rs:127-149 / get_arithmetic_field`). + - F-SF-008 + C-008 — Moved `use vortex_convertor` to the top of `lower_ir.rs` under `#[cfg(feature = "vortex")]`; canonical path through the re-export. + - C-005 — Refined `operand_is_bool`'s wildcard `IRFunctionExpr::Boolean(_) => true` to enumerate exactly the boolean-output variants (`IsNull`, `IsNotNull`, `Not` with recursion). The wildcard was over-permissive but sound by accident; precise enumeration is the right contract. + - C-007 — Collapsed the `if let Some(s) = schema { ... } if schema.is_none() { return None; }` redundant pair into a single `let Some(s) = schema else { return None }`. + + **Cycle-1 must-fix C-002 / should-fix F-SF-001 (pushdown engagement test)** — addressed via a new structural-assertion test `shape_plus_arithmetic_structural` that inspects the produced Vortex `Expression`'s SQL-form `Display` output. A paste-swap bug (`Operator::Plus => checked_mul(...)`) would now be caught by this unit test. The broader Python-level engagement assertion (planned for PR-2.2 as `POLARS_VORTEX_VERIFY_PUSHDOWN=1`) remains deferred — see new Deferred-work entry below. The cycle-1 carry-forward "tautological tests" entry (PR-2.1 deferred item i) is now partially addressed for the load-bearing Plus shape; other shapes still use `.is_some()`-only assertions. + + **Should-fix items deferred**: + - F-SF-004 (signature `schema: Option<&Schema>` exposes a `None` branch never reached in production) — the `None` branch is still useful for ad-hoc unit-test construction. Keep `Option<&Schema>`; the function doc now clarifies that production always supplies `Some`. Not deferred but accepted as-is. + - F-SF-007 (architectural divergence: `aexpr_filter` is NOT a `FileScanIR` field, unlike `segment_cache`) — accepted as intentional asymmetry. The convertor result depends on the per-lowering-pass `expr_arena`; threading through `FileScanIR` would either require Arc-wrapping the arena (heavy) or eagerly converting the predicate at DSL-to-IR time (which loses the chance for IR-level optimizations to fire first). Documented in the wire-up comment block. + - F-SF-009 / C-003 (test coverage for Plus on Float/Decimal/String) — partially addressed (Bool, String, Float now covered; Decimal kept as deliberate refuse via `is_vortex_numeric_dtype`). Cycle-1 should-fix (i) "tautological tests" cleanup remains deferred for other shapes. + - C-004 (Rust integration test for the lower_ir wire-up) — deferred; would require constructing a full IR plan. The structural-assertion test on the convertor closes the most-load-bearing gap. + +- **Tests added (cumulative across 3 commits)**: 5 schema-gate tests for Plus (no-schema, Bool, String, Float, plus the original Int) + 1 structural-assertion test (`shape_plus_arithmetic_structural`) + the original 4 schema-gate tests for And/Or/Not (cycle-1 work from PR-2.1's should-fix) = **36 convertor unit tests total** (up from PR-2.1's 23 + commit-1's 31). 1 new Python e2e test (`test_scan_with_arithmetic_filter`). polars-vortex test count unchanged at 69. +- **Confidence**: high. Cycle-1 must-fix items are well-localized correctness fixes with new unit-test coverage; should-fix deferrals are documented with rationale. +- **Surprises during implementation**: + - **`aexpr` is `pub(crate)`**: initial wire-up tried `polars_plan::plans::aexpr::predicates::vortex_convertor::*` which failed with E0603. The public route is `polars_plan::plans::predicates::vortex_convertor::*` via the glob re-export `pub use aexpr::*` in `plans/mod.rs`. The plan PR-2.1 row's "Files touched" list named `aexpr/predicates/` paths, but external visibility goes through the re-export. + - **Wire-up site is `lower_ir.rs:766`, not `to_graph.rs:843`**: plan PR-2.2 row referenced `to_graph.rs:843` but the actual `FileScanIR::Vortex` destructure where `expr_arena` is in scope is in `lower_ir.rs`. The plan row stays as a navigation hint; the actual change is at the corrected location. + - **Cycle-1 surfaced THREE distinct correctness concerns the implementer missed**: (a) hive-column reachability — the convertor would've happily emitted Vortex references to columns not in the Vortex file, reachable from production scan_vortex usage; (b) Vortex `checked_add` fallibility — scan-time errors instead of residual fallback; (c) Vortex `checked_add`'s `Binary::coerce_args` precondition. All three caught by the correctness lens; one (hive) also caught by the fresh lens via different reasoning (cross-cutting w/ the legacy path's hive-strip). **Process lesson**: when the convertor's contract says "always SAFE" but the underlying Vortex builders are fallible, the contract is load-bearing and EVERY new shape (Plus, future CAST/Struct/Temporal) needs an explicit "what does this builder do under hostile inputs" check. Future PR-2.3/.4/.5 rows should bake this check into the acceptance criteria. + +- **Cycle-2 review (2-vote `pr-2`, lenses=fresh+correctness, prior_fix_commit_sha=`c9172f95b`, both ACCEPT high-confidence)** — zero must-fix, 7 should-fix observations. Cycle-2 ACCEPT verdict via gauntlet rule (all N reviewers accept AND zero must-fix → overall accept). + + **Most-impactful cycle-2 finding**: **C2-001 (correctness, should-fix → applied)** — the cycle-1 hive guard had a SIBLING bug class via the fix-attention H4 self-reinforcement mechanism. `file_info.schema` "Does not include logical columns like `include_file_path` and row index" but the cycle-1 guard only checked `hive_parts.is_some()`. A user-reachable repro `pl.scan_vortex(path, row_index_name="ri").filter(pl.col("ri") > 10).collect()` would hit the convertor → emit `get_item("ri", root())` → Vortex bails at scan-time. **Fix in commit `d16b363c6`**: extended the guard to ALSO refuse when `unified_scan_args.row_index.is_some()` or `include_file_paths.is_some()`. Both new Python e2e tests (`test_scan_with_row_index_and_filter`, `test_scan_with_hive_partitioning_and_filter`) lock the refuse paths. + + **Other cycle-2 should-fix items applied inline in `d16b363c6`**: + - F-SF-CYCLE2-003 / C2-002 (Int128/UInt128 asymmetry): removed `Int128` from `is_vortex_numeric_dtype` to align with what `polars_scalar_to_vortex` actually translates (neither Int128 nor UInt128 had literal-conversion arms; both above Vortex's `PType` ceiling). + - F-SF-CYCLE2-001/002/005 + F-SF-CYCLE2-007 (doc-quality sweep): test-module doc-block updated to note the structural-assertion exception for Plus; "14 shapes" / "15 shapes" prose reconciled with table row count; Arithmetic-caveat doc-comment expanded to list Datetime/Time/Duration alongside Date. + - F-SF-CYCLE2-004 (nested-Plus + non-Plus-operand tests): added `shape_plus_nested_numeric_passes` and `shape_plus_with_multiply_operand_returns_none` (2 new tests). + - F-SF-CYCLE2-006 / C2-003 (Python e2e for hive-refuse path): added `test_scan_with_hive_partitioning_and_filter` and `test_scan_with_row_index_and_filter` (2 new Python tests). + +- **Final test count**: convertor unit tests 38 (was 31 → 36 → 38 across cycles); Python e2e tests +3 (`test_scan_with_arithmetic_filter`, `test_scan_with_hive_partitioning_and_filter`, `test_scan_with_row_index_and_filter`). +- **Final confidence**: high. Both cycle-2 reviewers accept with high confidence; all cycle-1 must-fix items are resolved with new test coverage; all cycle-2 should-fix observations are addressed inline or carried forward in Deferred work. +- **Deferred items growth (cycle-1 cumulative)**: 3 new Deferred-work entries (per-column hive-vs-file split, Vortex `wrapping_add` upstream API, POLARS_VORTEX_VERIFY_PUSHDOWN debug mode). Cycle-2 added 0 new (all should-fix items addressed inline). Vortex sink `Writeable::Cloud` arm flagged as a separate side-task via SpawnTask. Total `deferred_items_total: 10` (was 7 at PR-2.1 completion). + +### PR-2.1: PR-13.1 AExpr-direct convertor module foundation (3 PR-work commits, ending at `06f4f8592`) + +- **Scope shipped**: + - **New module** `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs` (~450 LoC including 23 tests): translates Polars `AExpr` predicate trees → Vortex `Expression` trees for filter pushdown. Covers the 14 foundational shapes per the plan PR-2.1 row: Column, Literal::Scalar, 6 comparisons (Eq/NotEq/Lt/LtEq/Gt/GtEq), And/Or + LogicalAnd/LogicalOr aliases, IsNull/IsNotNull/Not (via `IRBooleanFunction`). Exhaustive match on Polars `Operator` (8 mapped, 10 explicit None — arithmetic deferred to PR-2.2, EqValidity/NotEqValidity/Xor permanent None). `_ => None` wildcard catches unsupported AExpr variants (Cast → PR-2.3, StructField → PR-2.4, temporal → PR-2.5, Sort/Gather/Filter/Agg/Ternary/AnonymousFunction/Over/Rolling). Recursive `?`-propagation ensures any unsupported sub-tree poisons the whole tree (sound — multi-scan layer re-applies full predicate post-decode). + - **File location correction** (commit `e970759d7`): plan PR-2.1 row originally targeted `polars-vortex/src/read/aexpr_predicate.rs`, but polars-vortex cannot depend on polars-plan (the dep arrow points polars-plan → polars-vortex per polars-plan/Cargo.toml:28+:80, gated on the `vortex` feature). Convertor lives in polars-plan instead. Plan PR-2.1 row amended in commit `4f8750a59`. + - **Narrowed re-export widening** (commit `e970759d7`): added `expr` to `polars-vortex/src/lib.rs`'s narrowed re-export (now `{array, error, expr, file, io, layout}`) so polars-plan can reach `vortex::expr::{eq, lt, ...}` builders through the BAN-checked re-export. Audit-verified: only 6 sub-modules used externally; BAN remains machine-checkable. + - **`polars_scalar_to_vortex` visibility bump** (commit `e970759d7`): made `pub` (was `pub(super)`) so polars-plan's vortex_convertor reuses the single source of truth for `AnyValue` → `VortexScalar` mapping across both convertor paths. + - **Stub file** `crates/polars-vortex/src/read/aexpr_predicate.rs` (commit `e970759d7`): documents the architectural relocation; not registered in `read/mod.rs`, not compiled. Documentation-only; will be deleted in a housekeeping pass. + - **Cycle-1 should-fix work** (commit `06f4f8592`): bitwise-vs-logical TODO at function doc-block (Polars `And/Or/Not` are bitwise-or-logical, Vortex's are boolean-only; PR-2.2 wire-up site at `to_graph.rs:843` is the schema-gated mitigation point). Plus 2 None-returning tests for `Operator::EqValidity`/`NotEqValidity`. Test count 21 → 23. +- **Tests added**: 23 new unit tests in polars-plan's `vortex_convertor::tests` module (14 shape coverage + 4 unsupported-shape None coverage + 2 integration tests + 2 cycle-1-added None-coverage). polars-vortex test count unchanged at 69 (these tests live in polars-plan behind the `vortex` feature). +- **Review**: 2-vote (gauntlet `preset=pr-2`, lenses=fresh+correctness) / **accepted at cycle 1** (cycles: 1). Both reviewers high confidence; 0 must-fix, 5 should-fix (2 addressed inline, 3 deferred), 1 nit (orphan stub file). Full Synthesizer Output JSON in plan-commit `db61a822c` body. +- **Confidence**: high +- **Deferred items**: 1 new bundled Deferred-work entry covering 3 cycle-1 should-fix items (tautological tests; eager arg conversion; recursive-walk stack-overflow risk). Cumulative `deferred_items_total: 7` (was 6). +- **Surprises during implementation**: + - **Plan PR-2.1 row's file location was architecturally impossible**: polars-vortex can't depend on polars-plan (dep direction is the reverse, gated on `vortex` feature). Caught immediately by `cargo check` failing with E0433 on `use polars_plan::...` in the initial polars-vortex draft. Convertor relocated to polars-plan; plan row amended. **Planning lesson**: cross-crate type-flow constraints aren't visible in the "Files touched" plan column without an explicit dep-graph check. Future PR rows should be cross-validated against `cargo metadata --format-version 1 | jq '.resolve.nodes'` or equivalent before writing. + - **Tautological-test pattern re-occurrence**: cycle-1 fresh + correctness lenses flagged the same `.is_some()`-only test pattern that PR-2.0 cycle-2 caught on the C-003 unit tests. This is a fresh occurrence in new tests authored by the same agent — the carry-forward note didn't influence the new test design. **Process lesson**: when a should-fix is deferred (not fixed inline), reference it explicitly in subsequent PR's test-design phase to avoid pattern recurrence. + +### PR-2.0: Phase 2.0 housekeeping (12 PR-work commits, ending at `efcbc92f2`) + +- **Scope shipped**: + - **Plan-doc carry-forward fixes (4 items, commit `9ab7a486e`)**: stale test counts swept at plan:77 (Python 8→10), :96 (Rust 65→66 then later 66→69 after C-003 tests); EXECUTOR Arc pattern at plan:108 rewritten to match session.rs:40-48 (LazyLock>); RUSTSEC-2024-0436 added to PR-1.2 impl-status; PR-1.4 cycle-1 should-fix items migrated from plan-commit JSON into bulleted Deferred work entry with migration tracking. + - **Code-doc carry-forward fixes (4 items, commit `08378d0c6`)**: producer-error inline comment at sink mod.rs:127-145 trimmed from 18→8 lines, "whichever fires first" framing removed (cycle-2 correctness-lens finding); `pub use ::vortex;` doc-comment at lib.rs:14-21 replaced literal-paths-count with invariant ("anything outside {array,error,file,io,layout} fails to compile"); predicate.rs PR-2.6 scaffolding marker added; morsel_rx.recv() Err-as-EOS contract documented at io_sinks/mod.rs:115. + - **Dedicated single-cache refactor (commit `855f303cd`)**: introduced `VortexSegmentCacheRef` newtype wrapper at polars-vortex/src/read/mod.rs (Arc wrapped because dyn-trait doesn't impl Debug; FileScanIR derives Debug); added `segment_cache: Option` field to FileScanIR::Vortex + FileScanEqHashWrap with pointer-identity hash; threaded the resolved Arc from caller → vortex_file_info → FileScanIR → lower_ir → VortexReaderBuilder → VortexFileReader → initialize(). For `Dedicated(N)` on N-file scans, all N readers now share ONE Moka cache instance (was N independent caches pre-PR-2.0). Function-level doc-comment added to `vortex_file_info` documenting parameter contracts. + - **Cycle-1 fix C-001 (commit `d83aa119a`)**: schema-supplied path now resolves segment_cache once and threads to FileScanIR (was None → per-file fallback resolve, blowing up multi-file Dedicated to N caches). + - **Cycle-1 fix C-002 (same commit)**: lockstep drop of metadata + segment_cache_opt on max_metadata_scan_cached pressure. + - **Cycle-1 fix C-003 (commit `4abb933f6`)**: 3 wrapper-level unit tests in `polars-vortex/src/read/mod.rs::segment_cache_ref_tests` verifying VortexSegmentCacheRef preserves Arc::as_ptr identity through Clone / From / Deref. End-to-end harness deferred to PR-3.1 entry per documented carry-forward. + - **Cycle-2 fix C2-001 (commit `edbbfc2dd`)**: added `segment_cache: None` to expand_datasets.rs:299 — a second construction site of FileScanIR::Vortex that cycle-1 missed; `cargo check --features vortex,python` now passes. + - **Cycle-2 fix F-002 (commit `2025b9e1d`)**: rewrote Deferred-work entry at plan:1862 from "STILL DEFERRED" to "PARTIALLY RESOLVED" referencing the wrapper tests + carry-forward of end-to-end harness to PR-3.1. + - **Cycle-3 fix (commit `efcbc92f2`)**: cargo fmt on read/mod.rs:122 (the C-003 test assert_eq exceeded rustfmt's max-width). +- **Tests added**: 3 new unit tests for `VortexSegmentCacheRef` Arc identity preservation. Total polars-vortex test count: 66 → 69 (46 unit + 23 integration). +- **Review**: 2-vote (gauntlet `preset=pr-2`, lenses=fresh+correctness) / **accepted at cycle 4** (cycles: 4). Cycle 1: 3 must-fix (C-001 schema-supplied multi-file blowup; C-002 lifecycle leak; C-003 regression test contradiction). Cycle 2: 2 must-fix (C2-001 expand_datasets compile error; F-002 Deferred-text not updated). Cycle 3: 1 must-fix (cargo fmt). Cycle 4: 0 findings, both reviewers high confidence. Full Synthesizer Output JSON in plan-commit bodies for each cycle (`a443d0997`, `2025b9e1d`, `458894b9f`, `e0e2c1d76`). +- **Confidence**: high +- **Deferred items**: 0 new in Deferred-work bullets (carry-forward should-fix items from cycle-1/2/3 are tracked inline in plan-commit JSON bodies + the existing PR-1.4 migration entry covers Dedicated double-resolve resolution status). Cumulative `deferred_items_total: 6` unchanged. +- **Surprises during implementation**: + - **Cycle-1 missed a second construction site**: PR-2.0 commit 3 added the `segment_cache` field to FileScanIR::Vortex but missed the construction at `expand_datasets.rs:299` (post-expansion catch-up path for python-dataset providers). Discovered in cycle-2 gauntlet via `cargo check --features vortex,python` — the verified-clean check in the original commit message used `--features new_streaming` which doesn't compose with `vortex`. Process lesson: when adding a struct field, grep ALL constructions, not just the obvious ones; verify with `--all-features` not just per-crate feature combos. + - **Cycle-2 atomic commit conflated fix + counter updates**: commit `2025b9e1d` bundled the F-002 fix with cycle-2 Synthesizer JSON archive + counter updates. Per Step 2.4 discipline this should have been a fix-commit + separate decrement-commit; deviation noted but no functional impact. + - **Cycle-3 surfaced a rustfmt regression from cycle-1 work**: the C-003 unit tests in 4abb933f6 had one `assert_eq!` slightly over rustfmt's max-width — broke `cargo fmt --check`. Should have run cargo fmt locally before the cycle-1 fix push. Process lesson: include `cargo fmt --check` in the pre-push verification routine. + - **Pre-existing typos CI gap on .big-plans/ content**: `_typos.toml` doesn't exclude `.big-plans/` while `dprint.json` does. Several `mis-` compound words in the plan file (from prior cycles' review prose archives) keep tripping the typos check. Cycle-4 acknowledged this as pre-PR-2.0 and explicitly carry-forward; a one-line `_typos.toml` change (mirror dprint's `.big-plans/` exclude) would close it permanently. Not in PR-2.0 scope. + ## Resolved phase-end must-fix items — Phase 1: Ratify + crates.io transition — cycle 1 | Severity | File:line | Description | Implicated PR | Resolved | @@ -1858,8 +2367,33 @@ Seeded with carry-forward items from the existing plan's §13 that may surface a - **PR-1.2 Rust dispatch tighten + Rust unit test** (`crates/polars-python/src/lazyframe/general.rs:334-359`): pyo3 `new_from_vortex` match silently ignores `cache_dedicated_bytes` for `('global', _)` / `('off', _)` arms. Defensive `('dedicated', None)` and `(other, _)` arms are unreachable from Python's `_resolve_cache_mode` and have no Rust test coverage. Two paths: (a) tighten arms to require `None` for non-dedicated kinds with explicit error, (b) downgrade unreachable arms to `debug_assert!` / `unreachable!`. Plus add a Rust `#[test]` exercising all four arms via direct `PyLazyFrame::new_from_vortex` construction. Modest scope (~30 LoC + test infrastructure); deferred to a follow-up PR because PR-1.2 already grew significantly beyond planned scope absorbing CI-greenup. (Deferred from PR-1.2 gauntlet cycle 1 should-fix #3, 2026-05-15.) - **Vortex sink producer-error inline comment trim + minor polish** (`crates/polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs:127-145`): the inline comment introduced with MF-002 (~370 chars across 11 lines) is on the borderline of the project BAN against >100-char justifications. CORRECTNESS-LENS NOTE (cycle-2 phase-end review): the comment's "whichever fires first" framing is inaccurate — the producer/writer error path is actually DETERMINISTIC under all error scenarios. Tracing the await sequence: `producer.await?` at the join site polls the producer's JoinHandle to completion before `write_handle.await` runs. When the producer returns `Err(e)`, the user-visible error is always the original PolarsError; the VortexError-wrapped form sent through the channel is silently discarded when `?` unwinds and AbortOnDrop fires on `write_handle`. The channel-Err's role is to force the writer to bail before finalizing the footer (defense in depth), NOT to provide an alternate user-facing error path. Polish opportunity: trim the comment to 3-4 lines and correct the framing. Modest doc-only effort; out of PR-1.4 scope. (Deferred from PR-1.3 inner-loop cycle 1 should-fix #5 + corrected by PR-1.4 cycle-2 review, 2026-05-16.) - **Vortex sink producer-error regression test** (Python-level): MF-001 + MF-002 in PR-1.3 alter observable error-path behavior of `sink_vortex` (now produces a non-OK error on producer failure instead of a truncated-but-valid Vortex file). The existing Deferred work entry "Rust-level tests for the polars-stream Vortex source/sink" covers Rust-level harness construction. A NARROWER Python-level test (e.g., `py-polars/tests/unit/io/test_vortex.py` triggering producer error mid-stream via misaligned chunks or unsupported dtype and asserting `pl.exceptions.ComputeError` containing 'vortex sink producer' + invalid/absent on-disk file) would protect the MF-001/MF-002 guarantees without the Rust harness. Modest scope (~30 LoC test); deferred because this worktree has no installed py-polars wheel for iterative test development. (Deferred from PR-1.3 inner-loop cycle 1 should-fix #6, 2026-05-15.) +- **PR-2.1 cycle-1 carry-forward should-fix items (3 items deferred; bundled entry)**: cycle-1 gauntlet accepted PR-2.1 with 5 should-fix items. 2 addressed in commit `06f4f8592` (bitwise-vs-logical TODO + EqValidity None-tests). 3 deferred to follow-up: + - (i) **Tautological tests** (cycle-1 fresh F-001 + correctness #3): the 23 vortex_convertor unit tests assert `.is_some()` only without inspecting the produced Vortex `Expression` structure. A paste-swap bug between Eq/NotEq or Lt/Gt or And/Or or IsNull/IsNotNull would not be caught. Mitigation: assert on `Expression::to_string()` or proto-form structural equality. Same pattern flagged in PR-2.0 cycle-2 on the C-003 wrapper-identity tests; not yet addressed there either. Coupled rewrite (~80 LoC) plausibly deferred to PR-2.6's cleanup or a Phase 2.x housekeeping sub-PR. + - (ii) **Eager recursive arg conversion in Boolean Function arm for unsupported variants** (cycle-1 fresh F-004): the `AExpr::Function { Boolean(_) }` arm recursively converts `arg_node` BEFORE the inner match decides whether to map (IsNull/IsNotNull/Not) or fall through. For unsupported variants (IsBetween/IsIn/IsClose/AllHorizontal/AnyHorizontal/... 16 of 19 IRBooleanFunction variants), the recursive work is wasted. Negligible perf cost (predicates are small); structurally suboptimal. Easy fix is to match the boolean variant first; ships naturally alongside PR-2.2-.5's shape additions. + - (iii) **Recursive-walk stack-overflow risk on pathological inputs** (cycle-1 correctness nit #5): deeply nested predicates (thousands of clauses) would consume one stack frame per AExpr node. Vortex's own `and_collect`/`or_collect` builders use balanced binary trees to avoid this (see `vortex-array/src/expr/exprs.rs:345-356`). Typical predicate depths are small (~5-10 levels); guard via depth-counter or batch-flatten before recursing. Deferred to Phase 4 polish unless benchmarks show issues. + - **Vortex sink Writeable match non-exhaustive under `polars-stream --features vortex` (without `cloud`)** (`crates/polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs:91`): `Writeable::Cloud(_)` arm is `#[cfg(feature = "cloud")]` on polars-stream's own `cloud` feature, but the underlying enum's Cloud variant remains visible because `polars-io` (a non-optional polars-stream dep with `features = ["async", "file_cache"]`) enables `polars-io/cloud` transitively via `file_cache`. `cargo check -p polars-stream --features vortex` fails with E0004; `cargo check -p polars --features vortex,cloud,parquet,dtype-full` is clean because that combo keeps `polars-stream/cloud` on. Predates PR-1.3 (commit `bbe16b34a` introduced the cloud sink). Two fix paths: (a) make polars-stream's `vortex` feature transitively enable `cloud` (mirroring how `parquet` may handle it), or (b) add a fall-through wildcard arm in the Vortex sink match. Out of PR-1.3 scope (a 6-must-fix patch); track for Phase 2 cleanup or a follow-up PR. (Deferred from PR-1.3 inner-loop cycle 2 should-fix #1, 2026-05-15.) -- **PR-1.4 cycle-1 should-fix items migrated from plan-commit JSON body** (commit `24f8f5d4f`): the cycle-1 inner-loop review surfaced 4 should-fix items, captured in the commit body's Synthesizer Output JSON rather than in this Deferred work section. Migration tracking (PR-2.0 cycle-4 re-plan housekeeping): (a) **Dedicated double-resolve perf** — RESOLVED in PR-2.0 commit 3 via the `FileScanIR::Vortex.segment_cache` thread-through refactor (one resolved cache per logical scan instead of two). (b) **segment_cache regression test** — STILL DEFERRED. Add a unit test that `Dedicated(N)` produces ONE Moka cache instance per scan (measurable via `Arc::as_ptr` comparison between the IR-time and streaming-time cache references OR Moka hit-count after IR-time fetch). Modest scope (~20 LoC test); deferred because PR-2.0's scope is already substantive and the refactor itself is the load-bearing change. (c) **long doc-comments** (cycle-1 maint observation; producer-error inline comment + array_bridge SAFETY/PERF comments approaching the >100-char BAN threshold) — partially addressed in PR-2.0 commit 2 (producer-error inline comment trimmed to 3-4 lines, "whichever fires first" framing removed); array_bridge SAFETY comments deferred to Phase 4 polish per cycle-3 arch should-fix #5. (d) **signature shape** (vortex_file_info's many-positional-args shape vs `&VortexScanOptions` parameter object) — partially addressed in PR-2.0 commit 3 via the `VortexSegmentCacheRef` type alias at the signature site; the larger refactor to a parameter object remains deferred to Phase 3 (PR-3.1 will touch the same function for table_statistics population and is the natural moment to consolidate). +- ~~**Virtual-column-partitioned Vortex scans don't benefit from AExpr convertor pushdown**~~ — **RESOLVED in PR-2.8** (commits `9d9469f5a` + `f55e00b3e9` + `c9fd818616`): new helper `aexpr_file_minterms_to_vortex_expression` in `vortex_convertor.rs` walks top-level conjuncts via MintermIter, drops minterms whose leaves include any virtual col, and AND-collects the rest via Vortex's `and_collect`. Wired at the `FileScanIR::Vortex` arm of `lower_ir.rs` with the virtual_cols set built from `hive_parts.schema() + row_index.name + include_file_paths`. Strictly-better than the previous `aexpr_to_vortex_expression` direct call even when no virtual cols present (partial conversion of mixed-shape predicates: e.g., `a == 1 AND unsupported_op(b)` pushes `a == 1` instead of refusing the whole tree). 5 unit tests + 3 e2e tests (hive / row_index / include_file_paths virtual-col scenarios). Accepted at cycle 3 of the 2-vote pr-2 gauntlet (cycle 2 must-fix on test discriminator: `str.contains("a")` vs `str.ends_with("a.vortex")` for `include_file_paths` full-path semantics). (Resolved 2026-05-18.) + +- **Vortex `wrapping_add` (or non-fallible add) public API** (PR-2.2 cycle-1 must-fix M2 — partial resolution): The current `Plus → checked_add` mapping has a semantic divergence with Polars's wrapping `+`: Vortex errors at scan-time on integer overflow while Polars wraps. For typical OLAP queries with small-int data this is rare, but `col + 1 == big_value` on a column near MAX errors out instead of producing wrapped-then-compared results. PR-2.2 documents this in the function doc-comment; the proper fix is for Vortex to expose `wrapping_add` (or similar) in `vortex::expr::*` so polars-vortex can prefer it for Polars Plus semantics. Tracking as an upstream-Vortex coordination item — file when polars-vortex hits a real-world user query that surfaces the divergence, or as part of PR-2.5 (which already coordinates Vortex's `datetime_parts` op). (Deferred from PR-2.2 cycle-1 must-fix M2 / F-MF-002, 2026-05-16.) + +- ~~**Plus convertor cross-PType supertype gate**~~ — **RESOLVED in PR-2.4** (commit `eed93c119` + cycle-2 `8245aaf48`): added pairwise-equal-PType gates to BOTH the Plus arm (commit 1, proactive per the cycle-2 H4 carry-forward) AND the comparison arms (cycle-2 should-fix F-COMPARE-CROSS-PTYPE-001, also surfaced via H4 sibling check). `resolve_inner_dtype` extended to handle Cast/Plus/Function-StructField for the gate's dtype resolution. Tests `shape_plus_cross_ptype_returns_none`, `shape_plus_int_plus_float_returns_none`, `shape_plus_uint_plus_int_returns_none`, `shape_eq_cross_ptype_returns_none`, `shape_lt_cross_ptype_returns_none` lock the refuse paths. + +- ~~**Two additional §5-row shapes not yet in the AExpr-direct convertor**~~ — **RESOLVED in PR-2.7** (commits `065df50278` cycle-1 impl + `b2b9659670` / `85ca567308` / `90984521c3` cycle-1 must-fix gates + `fa89c9b395` / `d457a1300a` / `a9918a201b` / `aebb30250e` / `5e68b4f2d0` cycle-2 polish + `71540df774` / `024341dc21` final): `String::Contains{literal:true}` and `Ternary { predicate, then, otherwise }` ported as part of PR-2.7's 6-shape sweep (alongside is_between, is_in, starts_with, ends_with). Contains{literal:true} maps to `like(col, lit("%" ++ needle ++ "%"))`; Ternary maps to `case_when([(predicate, then)], Some(otherwise))` via Vortex's publicly-exposed `case_when` builder. Schema gates: Utf8 input for StringExpr arm, THEN/ELSE pairwise-dtype for Ternary arm (each caught at PR-2.7 cycle 1 as must-fix). Phase 2 exit criterion (b) satisfied. (Originally deferred from Phase 2 phase-end spec-lens review 2026-05-16; resolved 2026-05-18.) + +- ~~**PR-2.6 cutover-lost pushdown shapes**~~ — **RESOLVED in PR-2.7** (same commit set as the entry above): all 4 cutover-lost shapes (`is_between(lo, hi)`, `is_in([scalars])`, `str.starts_with(prefix)`, `str.ends_with(suffix)`) ported into the AExpr-direct convertor at `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs`. Each shape has a structural-assertion unit test (`shape_is_between_structural`, `shape_starts_with_structural`, `shape_ends_with_structural`) + a positive e2e Python test. `is_between` adds a pairwise-PType schema gate (mirroring the pattern from PR-2.3 CAST kind + PR-2.4 Plus pairwise-PType + comparison pairwise-PType gates). `is_in` adds defensive `nulls_equal=true + had_nulls=true` refuse (matches legacy path semantics) + reuses `try_extract_is_in_haystack` from polars-plan. `bytes_to_like_literal` helper resurrected verbatim from the deleted legacy path, inlined into `vortex_convertor.rs` (not re-exported through polars-vortex) — refuses `%` / `_` / `\\` to avoid LIKE-semantics widening. Total: 6 new convertor arms (Contains + Ternary from the entry above, plus these 4) + 88 unit tests (was 38 pre-PR-2.7) + 8 new e2e tests. Phase 2 exit criterion (b) satisfied via implementation (not deferral) for these 4 shapes. (Originally deferred from PR-2.6 cycle-1 must-fix MF-001/002/003 2026-05-16; resolved 2026-05-18.) + +- **Temporal-extract predicate pushdown (`col.dt.year() == 2024` etc.)** (PR-2.5 slip, 2026-05-16): Vortex 0.70.0 does not publicly expose `year` / `month` / `day` / `hour` / `minute` / `datetime_parts` / `date_part` builders in `vortex::expr::*` (the `scalar_fn/fns/` enumeration excludes them; the `extension/datetime/` module defines dtypes but no extract functions). Per the PR-2.5 plan row's contingency, the PR is deferred to a follow-up after Vortex exposes the relevant builders. Resolution paths: (a) wait for upstream Vortex to expose `datetime_parts` (file an issue / coordinate with Vortex maintainers), (b) bump the pinned Vortex version once a newer release ships the builders, OR (c) contribute the temporal-extract builders to upstream Vortex (out of scope for this branch per the plan's "no upstream Vortex API redesign" exclusion). Convertor's residual fallback handles `IRFunctionExpr::TemporalExpr(..)` shapes correctly via the existing `_ => None` arm; multi-scan re-applies post-decode so correctness is preserved. Perf-only deferral. (Slipped from PR-2.5, 2026-05-16.) + +- **Float16 (`PType::F16`) support in the convertor** (PR-2.3 cycle-2 perf-miss carry-forward): Vortex has `PType::F16` (`vortex-array/src/dtype/ptype.rs:54`); Polars added `DataType::Float16` (cf. polars-core/datatypes/dtype.rs:102). Neither `polars_dtype_to_vortex_dtype` nor `is_vortex_numeric_dtype` handles Float16, so `cast(f16_col, Float32)` refuses pushdown (conservative SOUND), Plus(f16, f16) also refuses. Perf-miss only — adding a Float16 arm to both helpers + corresponding Plus arm in `is_vortex_numeric_dtype` would enable Float16 pushdown. Modest scope (~6 LoC + 2 tests); defer until Float16 use cases surface in the Vortex integration. (Deferred from PR-2.3 cycle-2, 2026-05-16.) + +- **`POLARS_VORTEX_VERIFY_PUSHDOWN=1` debug-mode pushdown-engagement verification** (PR-2.2 cycle-1 escalation — partial resolution): The plan's PR-2.2 row originally specified `POLARS_VORTEX_VERIFY_PUSHDOWN=1` as a debug env-var that emits divergences between the AExpr-direct and legacy `SpecializedColumnPredicate` paths during the parallel window (PR-2.2-.5). PR-2.2 ships a NARROWER form: the structural-assertion unit test `shape_plus_arithmetic_structural` verifies the convertor produces the expected `checked_add` shape for the load-bearing Plus case. The Python-level e2e test `test_scan_with_arithmetic_filter` does NOT verify pushdown engagement (it would pass even if the convertor silently returned None, because the multi-scan layer reapplies post-decode). The full divergence-debug mode requires either (a) instrumenting `VortexFileReader::begin_read` to emit `POLARS_VERBOSE` lines showing which filter path fired, with a Python test asserting via `capfd`, or (b) exposing an `IOMetrics`-style counter for "AExpr-direct path engaged" vs "legacy path engaged" that the test can read. Either is ~30 LoC. Deferred to PR-2.5 entry (the last AExpr-shape PR before PR-2.6 cutover, where it's the natural debug-coverage anchor for the parallel-path window's final cycles). (Deferred from PR-2.2 cycle-1 C-002 / F-SF-001, 2026-05-16.) + +- **`is_in` arena-level structural unit test** (Phase 2 cycle-2 phase-end spec-lens should-fix, 2026-05-18): the convertor's `is_in` arm at `crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs:392-432` is covered by e2e Python tests but lacks an arena-level unit test with structural assertion (analogous to `shape_is_between_structural` etc. for the other 5 PR-2.7 shapes). The other 5 shapes' unit tests format the produced Vortex `Expression` via `Display::fmt` and assert positive + negative anchors — paste-swap defense. `is_in` requires constructing a `LiteralValue::Scalar(Scalar::new(DataType::List(Box::new(Int32)), AnyValue::List(series)))` haystack node, which needs spinning up a polars-core `Series` + the appropriate `SpecialEq`/`AnyValue::List` shell (~20-30 LoC of test setup vs ~10 LoC for other shapes). A regression where the helper silently returned `None` would not be caught at the unit-test layer; the e2e test would only catch a wiring break (Vortex bailing on missing column) or an incorrect result. Modest scope (~30 LoC + Series literal helper). Deferred — the e2e coverage protects user-visible correctness; this is defense-in-depth at the convertor-arm level. Tracked for Phase 3 housekeeping or a follow-up PR. (Deferred from Phase 2 cycle-2 phase-end spec-lens, 2026-05-18.) + +- **Float NaN semantic divergence in convertor float-comparison arms** (PR-2.7 cycle-2 nit #8, 2026-05-18): Polars's `is_in` (`polars-ops/src/series/ops/is_in.rs:19,26` — `TotalHash + TotalEq + ToTotalOrd`) uses TotalOrd which treats `NaN == NaN` as true (`polars-compute/src/comparisons/simd.rs:184-191 tot_eq_kernel` explicitly ORs `lhs_is_nan & rhs_is_nan` into the equality result). Vortex's `eq` kernel (`vortex-array/src/scalar_fn/fns/binary/compare.rs:174` → arrow `cmp::eq`) uses IEEE 754 where `NaN != NaN`. So `pl.col(float_col).is_in([NaN])` returns True for `col=NaN` under Polars, but the convertor's pushed `or(eq(col, NaN), ...)` returns null (which Polars filter drops as false) for `col=NaN`. This divergence ALSO affects the foundation `eq`/`not_eq`/`lt`/`lt_eq`/`gt`/`gt_eq` arms (pre-existing since PR-2.1) and PR-2.7's new `is_in` / `is_between` arms extend the surface area. The legacy `SpecializedColumnPredicate::EqualOneOf` path had the same divergence. Not introduced by PR-2.7; PR-2.7 just amplifies the surface. **Mitigation paths**: (a) refuse pushdown when ANY float literal in a comparison is NaN — detectable via `f64::is_nan()` / `f32::is_nan()` on the `AnyValue` at the `polars_scalar_to_vortex` boundary, conservative, ~10 LoC; (b) accept the divergence and document as a known limitation (residual reapply preserves correctness, so the user-visible result is correct — just sometimes computed less efficiently). Option (a) is cheap and conservative. Modest scope (~10 LoC + unit tests like `shape_is_in_float_with_nan_returns_none`). Deferred — not blocking PR-2.7 since residual reapply gives correct Polars semantics. Worth tracking so a future read doesn't assume float pushdown is fully isomorphic. (Deferred from PR-2.7 cycle-2 correctness nit #8, 2026-05-18.) + +- **PR-1.4 cycle-1 should-fix items migrated from plan-commit JSON body** (commit `24f8f5d4f`): the cycle-1 inner-loop review surfaced 4 should-fix items, captured in the commit body's Synthesizer Output JSON rather than in this Deferred work section. Migration tracking (PR-2.0 cycle-4 re-plan housekeeping): (a) **Dedicated double-resolve perf** — RESOLVED in PR-2.0 commit 3 via the `FileScanIR::Vortex.segment_cache` thread-through refactor (one resolved cache per logical scan instead of two). (b) **segment_cache regression test** — PARTIALLY RESOLVED in PR-2.0 commit `4abb933f6` (3 wrapper-level unit tests in `crates/polars-vortex/src/read/mod.rs::segment_cache_ref_tests` verifying VortexSegmentCacheRef preserves Arc identity through Clone / From / Deref — the single-cache contract at the wrapper layer). The full end-to-end harness (`pl.scan_vortex(['a.vortex','b.vortex'], cache_mode=Dedicated(N))` confirming ONE Moka cache instance across IR-build + multi-file streaming reads via `Arc::as_ptr` agreement OR Moka hit-count assertion) is **still DEFERRED to PR-3.1's `table_statistics` entry** per the cycle-3 reviewer recommendation: PR-3.1 already touches `vortex_file_info` and is the natural moment to add the end-to-end test harness (gate via debug-only `Arc::as_ptr` inspection or Moka `EntryWeigher` instrumentation). Modest scope (~30 LoC test); the wrapper-level unit tests in PR-2.0 close the cycle-1/cycle-2 must-fix C-003 contradiction at the artifact level — a future regression of the IR-thread-through invariant would manifest as the wrapper-tests passing while the end-to-end test fails, which PR-3.1's harness will catch. (c) **long doc-comments** (cycle-1 maint observation; producer-error inline comment + array_bridge SAFETY/PERF comments approaching the >100-char BAN threshold) — partially addressed in PR-2.0 commit 2 (producer-error inline comment trimmed to 3-4 lines, "whichever fires first" framing removed); array_bridge SAFETY comments deferred to Phase 4 polish per cycle-3 arch should-fix #5. (d) **signature shape** (vortex_file_info's many-positional-args shape vs `&VortexScanOptions` parameter object) — partially addressed in PR-2.0 commit 3 via the `VortexSegmentCacheRef` type alias at the signature site; the larger refactor to a parameter object remains deferred to Phase 3 (PR-3.1 will touch the same function for table_statistics population and is the natural moment to consolidate). ## Accepted tradeoffs / r1 traps @@ -1867,7 +2401,7 @@ Items reviewers may otherwise re-surface but that the user has explicitly accept - **`mem::transmute` in `read/array_bridge.rs` + `write/array_bridge.rs` + `write/df_to_stream.rs`**: accepted. polars-arrow's `ArrowArray` / `ArrowSchema` FFI structs have `pub(super)` fields — `mem::transmute` is the only way an external crate can construct these from upstream FFI structs. Compile-time `size_of` + `align_of` asserts + runtime length check provide safety. The clean alternative (`from_ffi_parts(...)` upstream API) is a separate polars contribution. - **Single Tokio runtime (Polars' global `ASYNC`) for Vortex async work**: accepted; the alternative (Vortex spinning its own runtime) doubles thread-pool overhead. `Handle::new(Arc::downgrade(Arc::new(ASYNC.handle()) as Arc))` is the established pattern. -- **`SpecializedColumnPredicate` fast path preserved during PR-13 transitional phases (Option B trajectory)**: accepted. Subagent 7 recommended this for de-risking; PR-13.6 deletes the old path once the AExpr-direct path proves itself across PR-13.2–.5 e2e tests. +- ~~**`SpecializedColumnPredicate` fast path preserved during PR-13 transitional phases (Option B trajectory)**: accepted.~~ **NO LONGER ACCEPTED** — PR-2.6 (commit `3956f17c5`) deleted the SpecializedColumnPredicate-derived fast path entirely; the AExpr-direct convertor is now the sole filter-pushdown path for Vortex scans. Option B → A cutover complete. (Kept in audit trail per the strikethrough convention used elsewhere — Phase 1 path-dep and visitor-cfg-gating tradeoffs use the same marker.) - **`hashbrown 0.16` (Polars) + `0.17` (Vortex transitive) coexistence**: accepted. BAN against bare `PlHashMap::new()` is the established workaround. - **~~Workspace `vortex = { path = "..." }` path-dep~~**: **NO LONGER ACCEPTED** — PR-1.1 migrates to crates.io `vortex = "0.70.0"`. (Kept in the audit trail as a strikethrough so reviewers cross-referencing the existing 1,145-line plan understand the change.) - **`create_skip_batch_predicate = false` for Vortex** (explicit branch at `polars-mem-engine/src/planner/lp.rs:448-459`): accepted. Vortex's `LayoutReader::pruning_evaluation` already does zone-level pruning; Polars' per-row-group skip-predicate machinery would be redundant. Explicit branch makes this intentional, not implicit. diff --git a/crates/polars-plan/src/plans/aexpr/predicates/mod.rs b/crates/polars-plan/src/plans/aexpr/predicates/mod.rs index 8623d80c32ea..d6a72a802f64 100644 --- a/crates/polars-plan/src/plans/aexpr/predicates/mod.rs +++ b/crates/polars-plan/src/plans/aexpr/predicates/mod.rs @@ -1,5 +1,7 @@ mod column_expr; mod skip_batches; +#[cfg(feature = "vortex")] +pub mod vortex_convertor; use std::borrow::Cow; diff --git a/crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs b/crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs new file mode 100644 index 000000000000..6574908e885f --- /dev/null +++ b/crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs @@ -0,0 +1,2983 @@ +//! AExpr-direct convertor for Polars predicates → Vortex `Expression` (PR-13 path). +//! +//! Translates Polars [`AExpr`] trees into Vortex [`Expression`] trees for filter pushdown, +//! walking the `Arena` directly. As of PR-2.6 (Option B → A cutover) this is the +//! SOLE filter-pushdown path for Vortex scans; the legacy +//! `polars_vortex::read::predicate::polars_to_vortex_predicate` (which consumed +//! pre-extracted [`polars_io::predicates::SpecializedColumnPredicate`] shapes) was +//! deleted in PR-2.6. The AExpr-direct path handles everything the legacy path handled +//! plus shapes the optimizer doesn't pre-extract — multi-column comparisons, arithmetic +//! in predicates, CAST in predicates, struct field access in predicates. Temporal +//! extracts remain residual until upstream Vortex exposes the relevant builders (see +//! plan Deferred work). +//! +//! ## Why this lives in `polars-plan` (not `polars-vortex`) +//! +//! The original PR-2.1 plan row listed `crates/polars-vortex/src/read/aexpr_predicate.rs` +//! as the target location, but `polars-vortex` cannot depend on `polars-plan`: the workspace +//! `polars-plan/Cargo.toml` declares an optional `polars-vortex = { workspace = true, optional +//! = true }` dep gated on the `vortex` feature (mirrored at lines 28 + 80 of that file), so +//! the dependency arrow points polars-plan → polars-vortex. The convertor consumes [`AExpr`], +//! [`LiteralValue`], [`Operator`], and [`IRBooleanFunction`] — all polars-plan types — so it +//! has to live here. (The scalar-conversion helper +//! [`polars_vortex::read::predicate::polars_scalar_to_vortex`] is shared via a public +//! re-export so the AnyValue→VortexScalar mapping has one canonical source of truth across +//! both the SpecializedColumnPredicate path and this AExpr-direct path.) +//! +//! ## What this module covers (PR-2.1 foundation + PR-2.2 + PR-2.3 + PR-2.4 + PR-2.7 extensions) +//! +//! The 22 shapes below — the "kernel" the rest of PR-13 extends. (13 from PR-2.1's +//! foundation + 1 from PR-2.2: `addition (numeric)` + 1 from PR-2.3: `cast (same-kind +//! Primitive/Bool/Utf8 target)` + 1 from PR-2.4: `struct field access` + 6 from PR-2.7's +//! cutover-lost-shapes port: `is_between` / `is_in` / `starts_with` / `ends_with` / +//! `contains{literal:true}` / `Ternary`.) +//! +//! | Shape | AExpr matcher | Vortex builder | +//! |---|---|---| +//! | column reference | `AExpr::Column(name)` | `get_item(name, root())` | +//! | scalar literal | `AExpr::Literal(LiteralValue::Scalar(s))` | `lit(s)` | +//! | equality | `AExpr::BinaryExpr { op: Eq, .. }` | `eq` | +//! | inequality | `AExpr::BinaryExpr { op: NotEq, .. }` | `not_eq` | +//! | `<` | `AExpr::BinaryExpr { op: Lt, .. }` | `lt` | +//! | `<=` | `AExpr::BinaryExpr { op: LtEq, .. }` | `lt_eq` | +//! | `>` | `AExpr::BinaryExpr { op: Gt, .. }` | `gt` | +//! | `>=` | `AExpr::BinaryExpr { op: GtEq, .. }` | `gt_eq` | +//! | logical AND | `AExpr::BinaryExpr { op: And \| LogicalAnd, .. }` | `and` (schema-gated for `And`) | +//! | logical OR | `AExpr::BinaryExpr { op: Or \| LogicalOr, .. }` | `or` (schema-gated for `Or`) | +//! | addition (numeric) | `AExpr::BinaryExpr { op: Plus, .. }` (PR-2.2) | `checked_add` (schema-gated to numeric) | +//! | cast (same-kind) | `AExpr::Cast { dtype, options: Strict, .. }` (PR-2.3) | `cast` (kind-gated: Primitive↔Primitive, Bool↔Bool, Utf8↔Utf8) | +//! | struct field access | `AExpr::Function { StructExpr(FieldByName(n)), .. }` (PR-2.4) | `get_item(n, inner)` (schema-gated to require the field) | +//! | `is_null` | `AExpr::Function { Boolean(IsNull), .. }` | `is_null` | +//! | `is_not_null` | `AExpr::Function { Boolean(IsNotNull), .. }` | `is_not_null` | +//! | `not` | `AExpr::Function { Boolean(Not), .. }` | `not` (schema-gated) | +//! | `is_between(lo, hi, closed)` | `AExpr::Function { Boolean(IsBetween { closed }), .. }` (PR-2.7) | `(col gt[_eq] lo) AND (col lt[_eq] hi)` (schema-gated pairwise-PType) | +//! | `is_in([...])` | `AExpr::Function { Boolean(IsIn { nulls_equal }), .. }` (PR-2.7) | OR of equalities (refuses `nulls_equal=true + had_nulls`; per-scalar conversion failure refuses) | +//! | `str.starts_with(prefix)` | `AExpr::Function { StringExpr(StartsWith), .. }` (PR-2.7) | `like(col, lit("prefix%"))` (schema-gated Utf8 input; `bytes_to_like_literal` refuses LIKE wildcards `%`/`_`/`\`) | +//! | `str.ends_with(suffix)` | `AExpr::Function { StringExpr(EndsWith), .. }` (PR-2.7) | `like(col, lit("%suffix"))` (same gates) | +//! | `str.contains{literal:true}(sub)` | `AExpr::Function { StringExpr(Contains { literal: true, .. }), .. }` (PR-2.7) | `like(col, lit("%sub%"))` (same gates; `literal: false` refuses — Vortex LIKE doesn't do regex) | +//! | `Ternary { predicate, truthy, falsy }` | `AExpr::Ternary { .. }` (PR-2.7) | `case_when(condition, then, else)` (schema-gated THEN/ELSE pairwise-dtype) | +//! +//! ## What this module does NOT cover yet (PR-2.5 follow-ups + Phase 3 work) +//! +//! Remaining arithmetic (`Minus`/`Multiply`/divides/`Modulus`) → still residual; PR-2.2 +//! ships `Plus` only because `checked_add` is the only arithmetic builder publicly exposed +//! in `vortex::expr::*` at the pinned SHA. Cross-kind CAST (Primitive↔Bool/Utf8) and +//! `NonStrict`/`Overflowing` CAST options → still residual (PR-2.3 cycle-1 must-fix gates; +//! Vortex's per-array `CastKernel` is strictly within-kind and fail-on-overflow). Other +//! struct functions (`RenameFields`/`PrefixFields`/etc.) → not in predicate scope. Temporal +//! extracts (`AExpr::Function { IRFunctionExpr::TemporalExpr(..), .. }`) → PR-2.5 (slipped +//! to Deferred work; Vortex `datetime_parts` op unavailable at pinned SHA). Anything else +//! (`Sort`, `Gather`, `Filter`, `Agg`, `AnonymousFunction`, `Over`, `Rolling`, etc.) +//! returns `None` and falls through as residual; the multi-scan layer re-applies the full +//! predicate post-decode so dropping coverage is always SOUND, just suboptimal. +//! +//! ## Wiring +//! +//! Wired inside the `FileScanIR::Vortex` match arm of +//! `crates/polars-stream/src/physical_plan/lower_ir.rs::lower_ir` (search for +//! `FileScanIR::Vortex` — the line range shifts across cleanup PRs), where the [`AExpr`] +//! arena is live alongside the predicate `ExprIR`. The resulting `Expression` is attached +//! to the Vortex `VortexReaderBuilder.aexpr_filter` field via a Vortex-specific side +//! channel (parallel to how `FileScanIR::Vortex::metadata` and the (PR-2.0) `segment_cache` +//! thread). `VortexFileReader::begin_read` uses `aexpr_filter` directly. PR-2.6 deleted +//! the legacy `polars_to_vortex_predicate` (`SpecializedColumnPredicate`-derived) path; +//! the AExpr-direct convertor is now the sole filter-pushdown path. +//! +//! ## SCHEMA-GATE convention (added Phase 2 cycle-2 arch-lens should-fix) +//! +//! Each pattern-match arm in [`aexpr_to_vortex_expression`] that DIRECTLY constructs a +//! Vortex builder (i.e., bypasses the [`AExpr::BinaryExpr`] arm's gates by matching on a +//! more specific `AExpr` shape) carries a single-line `// SCHEMA-GATE: ` marker as +//! the first line of the arm body. Greppable via +//! `rg '// SCHEMA-GATE:' crates/polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs`. +//! +//! The marker codifies a recurring discipline: PR-2.2/.3/.4/.7 each surfaced the same bug +//! class — arms that build Vortex builders directly without revalidating operand types +//! against the always-SAFE-fallback contract emit expressions that bail at scan-time. The +//! markers force a future arm-adder to either (a) replicate the gate or (b) explicitly +//! justify why no gate is needed for the new shape. +//! +//! Current gates: +//! - `AExpr::BinaryExpr` → And/Or boolean-only + Plus pairwise-equal-PType + comparison pairwise-equal-PType +//! - `is_between` → pairwise-PType across col/lo/hi (col must be numeric and match both bounds) +//! - `is_in` → `try_extract_is_in_haystack` enforces haystack dtype == column dtype; refuse on `nulls_equal + had_nulls` +//! - `Function::Boolean(boolean_fn)` (Not/IsNull/IsNotNull) → boolean-only for Not +//! - `StructField` → schema-membership (struct must contain the requested field) +//! - `StringExpr` (starts_with/ends_with/contains{literal:true}) → Utf8 input +//! - `Cast` → source-kind compat + Strict-options only +//! - `Ternary` → THEN/ELSE pairwise-dtype equality + +use polars_core::chunked_array::cast::CastOptions; +#[cfg(feature = "strings")] +use polars_core::prelude::AnyValue; +use polars_core::prelude::DataType; +#[cfg(feature = "is_in")] +use polars_core::scalar::Scalar; +use polars_core::schema::Schema; +#[cfg(feature = "is_between")] +use polars_ops::series::ClosedInterval; +use polars_utils::aliases::PlHashSet; +use polars_utils::arena::{Arena, Node}; +use polars_utils::pl_str::PlSmallStr; +#[cfg(feature = "strings")] +use polars_vortex::vortex::array::scalar::Scalar as VortexScalar; +use polars_vortex::vortex::dtype::{DType, Nullability, PType}; +#[cfg(feature = "strings")] +use polars_vortex::vortex::expr::like; +#[cfg(feature = "is_in")] +use polars_vortex::vortex::expr::or_collect; +use polars_vortex::vortex::expr::{ + Expression, and, and_collect, case_when, cast, checked_add, eq, get_item, gt, gt_eq, + is_not_null, is_null, lit, lt, lt_eq, not, not_eq, or, root, +}; + +use crate::dsl::Operator; +use crate::plans::AExpr; +use crate::plans::aexpr::MintermIter; +#[cfg(feature = "strings")] +use crate::plans::aexpr::function_expr::IRStringFunction; +#[cfg(feature = "dtype-struct")] +use crate::plans::aexpr::function_expr::IRStructFunction; +use crate::plans::aexpr::function_expr::{IRBooleanFunction, IRFunctionExpr}; +use crate::plans::lit::LiteralValue; +use crate::utils::aexpr_to_leaf_names_iter; + +/// Convert a Polars AExpr predicate tree into a Vortex [`Expression`] for pushdown. +/// +/// Returns `Some(expr)` when every node in the tree maps to a Vortex expression; returns +/// `None` if ANY node is unsupported (Vortex can't filter on what it can't represent, so +/// partial pushdown would be incorrect — *narrower* than the user's predicate). The +/// multi-scan layer always re-applies the full predicate post-decode, so `None` is the +/// safe fallback. +/// +/// # Parameters +/// +/// - `root_node` — the AExpr root to convert. The convertor walks the tree rooted here. +/// - `arena` — the [`AExpr`] arena. Borrowed immutably (no nodes added). +/// - `schema` — the resolved [`Schema`] of the columns the AExpr references. Used by +/// three gates: (a) the And/Or/Not bitwise-vs-logical gate (refuses pushdown on +/// integer operands); (b) the Plus arm's numeric gate (refuses pushdown on non-numeric +/// operands); (c) the CAST arm's source-kind gate (refuses cross-kind casts that +/// Vortex's per-array `CastKernel` rejects). When `None`, the convertor conservatively +/// refuses And/Or/Not AND Plus AND CAST pushdown. Production wire-up +/// (`physical_plan::lower_ir`) always supplies `Some`; `None` is exposed only to keep +/// ad-hoc unit-test construction ergonomic. +/// +/// # Returns +/// +/// `Some(Expression)` on full pushdown; `None` if any shape can't be translated. Always +/// SAFE — the caller treats `None` as "not pushable" and lets the residual filter run. +/// +/// # Bitwise-vs-logical operator caveat (addressed via schema gate) +/// +/// Polars' [`Operator::And`] / [`Operator::Or`] and [`IRBooleanFunction::Not`] are +/// **bitwise-OR-logical**: they work on integer columns as bitwise ops AND on bool columns +/// as logical ops. (Operator::And/Or dispatch through `aexpr/schema.rs::get_arithmetic_field` +/// which returns the operand dtype unchanged; for `Not`, see +/// `crates/polars-plan/src/plans/aexpr/function_expr/boolean.rs:49 // Also bitwise negate`.) +/// Vortex's [`and`], [`or`], and [`not`] are **boolean-only**. +/// +/// PR-2.2 mitigates by threading `schema` and gating each of And/Or/Not on +/// `operand_is_bool` (mirroring [`super::column_expr`]'s `dtype.is_bool()` guard at lines +/// 245-247). For `LogicalAnd` / `LogicalOr` we skip the gate because the IR-level "logical" +/// form is by construction boolean-typed. +/// +/// # Arithmetic semantic caveat (PR-2.2 / PR-13.2) +/// +/// Plus is mapped to Vortex's `checked_add` — the only arithmetic builder publicly exposed +/// in `vortex::expr::*` at the pinned SHA. Vortex's `checked_add` is **fallible on +/// overflow** (`vortex-array/src/expr/analysis/fallible.rs:36 +/// checked_add_defaults_to_fallible`): an integer overflow during scan errors out at +/// scan-time rather than silently wrapping (Polars' `+` operator wraps). The convertor +/// only emits `checked_add` when both operands are numeric (the `operand_is_numeric` gate); +/// String/Bool/Date/Datetime/Time/Duration/Struct/List Plus operations refuse pushdown +/// (without the gate, Vortex's `Binary::coerce_args` would `vortex_bail!` at scan-time, +/// violating the always-SAFE-fallback contract). For numeric Plus near boundary values, +/// the user observes +/// a scan-time error instead of Polars' wrapping behavior. This is a known semantic +/// divergence; see Deferred work (`Vortex wrapping_add public API`). +/// +/// # CAST semantic caveat (PR-2.3 / PR-13.3) +/// +/// Polars `AExpr::Cast` is mapped to Vortex's `cast` builder under **two gates** to +/// preserve the always-SAFE-fallback contract: +/// +/// 1. **`CastOptions::Strict` only.** Polars `NonStrict` (overflow→null) and +/// `Overflowing` (wrap) diverge from Vortex's `Primitive::CastKernel` fail-on-overflow +/// semantics (`vortex-array/src/arrays/primitive/compute/cast.rs:85-91` +/// `vortex_bail!`s on values exceeding target range). Pushing non-Strict down would +/// convert Polars' silent-or-null behavior into a hard scan-time `ComputeError`. +/// Refuse for any non-Strict option. +/// +/// 2. **Same-kind source/target only**, via `cast_kind_compatible`. Vortex's per-array +/// `CastKernel` impls are strictly within-kind: `Primitive::cast` returns `Ok(None)` +/// for non-Primitive targets (`primitive/compute/cast.rs:62-64`); `Bool::cast` for +/// non-Bool targets (`bool/compute/cast.rs:41-43`); `VarBinView::cast` for +/// non-Utf8/Binary (`varbinview/compute/cast.rs:60-62`). Cross-kind casts cause +/// `cast/mod.rs:120` to `vortex_bail!("No CastKernel ...")` at scan-time. Refuse +/// when source and target are in different kinds (Primitive ↔ Primitive, Bool ↔ Bool, +/// Utf8 ↔ Utf8 only). +/// +/// Together: cross-kind CAST (Int → String, Bool → Int, etc.) and non-Strict CAST fall +/// through to residual via `?`-propagation. Within-kind Strict CAST near boundary values +/// (e.g., `cast(Int64, Int8)` on overflow) WILL still scan-time-error — consistent with +/// Polars Strict semantics. +pub fn aexpr_to_vortex_expression( + root_node: Node, + arena: &Arena, + schema: Option<&Schema>, +) -> Option { + match arena.get(root_node) { + // --- leaves --- + AExpr::Column(name) => Some(get_item(name.as_str(), root())), + AExpr::Literal(lv) => convert_literal(lv), + + // --- comparisons + boolean combinators (BinaryExpr) --- + AExpr::BinaryExpr { left, op, right } => { + // SCHEMA-GATE: And/Or boolean-only + Plus pairwise-equal-PType + comparison pairwise-equal-PType. + // Bitwise-vs-logical schema gate (cycle-1 should-fix from PR-2.1): Polars + // `And/Or` are bitwise-or-logical (aexpr/schema.rs:127-149 dispatches through + // `get_arithmetic_field` so output dtype follows operand dtype). Vortex's + // `and`/`or` are boolean-only. Refuse pushdown when either operand is not + // boolean. `LogicalAnd`/`LogicalOr` are skipped — the IR-level "logical" form + // is by construction boolean-typed. + if matches!(op, Operator::And | Operator::Or) { + let s = schema?; + if !operand_is_bool(*left, arena, s) || !operand_is_bool(*right, arena, s) { + return None; + } + } + // Plus numeric + pairwise-equal-PType gate (PR-2.2 cycle-1 must-fix + PR-2.4 + // proactive fix for the cycle-2-surfaced sibling bug class): + // + // Vortex's `checked_add` requires `lhs.is_primitive() && lhs.eq_ignore_nullability(rhs)` + // (`vortex-array/src/scalar_fn/fns/binary/mod.rs:115-127` — `return_dtype` + // `vortex_bail!`s with "incompatible types for arithmetic operation" otherwise). + // Polars allows Plus on String (concat), Bool, Date+Duration, etc. — emitting + // `checked_add` on those (or on cross-PType operands like Int8+Int64) would + // bail at scan-time, violating the always-SAFE-fallback contract. + // + // In typical Polars usage, the TYPE_COERCION optimizer rule inserts a Cast + // BEFORE the Plus to align operand dtypes — so the Cast arm fires first and + // the outer Plus sees same-PType operands. When TYPE_COERCION is disabled (or + // an AExpr bypasses the optimizer), we still need to refuse pushdown. Two + // gates: (a) both operands are numeric (per `is_vortex_numeric_dtype`), + // (b) both operands resolve to the SAME numeric DataType. `resolve_inner_dtype` + // handles Column / Literal / Cast / comparisons; unresolvable shapes fall + // through to None → conservative refuse. + if matches!(op, Operator::Plus) { + let s = schema?; + if !operand_is_numeric(*left, arena, s) || !operand_is_numeric(*right, arena, s) { + return None; + } + // Pairwise-equal-PType gate (PR-2.4 proactive fix; addresses PR-2.3 + // cycle-2 H4 self-reinforcement finding). + let lhs_dt = resolve_inner_dtype(*left, arena, s)?; + let rhs_dt = resolve_inner_dtype(*right, arena, s)?; + if lhs_dt != rhs_dt { + return None; + } + } + // Comparison pairwise-equal-PType gate (PR-2.4 cycle-2 should-fix + // F-COMPARE-CROSS-PTYPE-001 — same bug class as the Plus gate above; + // surfaced by the cycle-1 fresh reviewer applying H4 to find the sibling). + // + // Vortex's `Binary::return_dtype` (`vortex-array/src/scalar_fn/fns/binary/mod.rs:130-136`) + // `vortex_bail!`s with "Cannot compare different DTypes" for comparison ops + // when `!lhs.eq_ignore_nullability(rhs) && !lhs.is_extension() && + // !rhs.is_extension()`. Extension types (Date, Datetime, Time, Duration) are + // exempt — Vortex permits Date<->Datetime comparison and similar. For + // non-extension cross-PType operands (Int32 vs Int64, etc.) we refuse + // pushdown to preserve the always-SAFE-fallback contract. TYPE_COERCION + // normally inserts a Cast that aligns dtypes; this gate covers the + // type_coercion-off path. Note: extension-type Polars dtypes aren't in + // `is_vortex_numeric_dtype` and don't currently route through this gate's + // dtype-resolution shapes anyway, so we don't need to special-case extension + // here. + if matches!( + op, + Operator::Eq + | Operator::NotEq + | Operator::Lt + | Operator::LtEq + | Operator::Gt + | Operator::GtEq + ) { + let s = schema?; + let lhs_dt = resolve_inner_dtype(*left, arena, s)?; + let rhs_dt = resolve_inner_dtype(*right, arena, s)?; + if lhs_dt != rhs_dt { + return None; + } + } + let lhs = aexpr_to_vortex_expression(*left, arena, schema)?; + let rhs = aexpr_to_vortex_expression(*right, arena, schema)?; + Some(match op { + Operator::Eq => eq(lhs, rhs), + Operator::NotEq => not_eq(lhs, rhs), + Operator::Lt => lt(lhs, rhs), + Operator::LtEq => lt_eq(lhs, rhs), + Operator::Gt => gt(lhs, rhs), + Operator::GtEq => gt_eq(lhs, rhs), + Operator::And | Operator::LogicalAnd => and(lhs, rhs), + Operator::Or | Operator::LogicalOr => or(lhs, rhs), + // Plus → Vortex `checked_add` (PR-2.2 / PR-13.2). The only Vortex + // arithmetic builder publicly exposed in `vortex::expr::*` is + // `checked_add`; Sub/Mul/Div remain residual until upstream exposes + // the corresponding `checked_sub`/etc. helpers (or until polars-vortex + // adopts the raw `Binary.try_new_expr(Operator::Sub, ...)` form). + Operator::Plus => checked_add(lhs, rhs), + // EqValidity / NotEqValidity — null-aware equality variants Vortex doesn't + // have a direct equivalent for; fall through to residual. + Operator::EqValidity | Operator::NotEqValidity => return None, + // Other arithmetic (Minus/Multiply/RustDivide/TrueDivide/FloorDivide/ + // Modulus) → still residual; PR-2.2 ships Plus only, per the plan's + // PR-13.2 acceptance test (`col + 1 == 5`). + Operator::Minus + | Operator::Multiply + | Operator::RustDivide + | Operator::TrueDivide + | Operator::FloorDivide + | Operator::Modulus => return None, + // Bitwise Xor — no Vortex equivalent in the predicate context; residual. + Operator::Xor => return None, + }) + }, + + // --- is_between (PR-2.7) --- + // `col.is_between(lo, hi, closed)` decomposed to + // `(col >= lo) AND (col <= hi)` (or strict variants per `closed`). + // Manual decomposition avoids the need to import Vortex's `BetweenOptions`, + // which lives in `vortex::scalar_fn::*` (outside polars-vortex's narrowed + // re-export — see `polars-vortex/src/lib.rs:24` `pub mod vortex`). Vortex's + // `Between` kernel lowers internally to the same comparison-pair, so pruning + // effectiveness is preserved. + // + // **Explicit pairwise-PType gate (PR-2.7 cycle-1 must-fix #1).** The arm + // constructs Vortex `gt`/`gt_eq`/`lt`/`lt_eq` builders DIRECTLY rather than + // re-entering the `BinaryExpr` arm, so the BinaryExpr arm's pairwise-PType + // gate would NEVER fire transitively. Cross-PType bounds (e.g., Int32 col + // with Int64 literal bounds when TYPE_COERCION is off) would `vortex_bail!` + // at scan-time in Vortex's `Binary::return_dtype` + // (`vortex-array/src/scalar_fn/fns/binary/mod.rs:130-136` — "Cannot compare + // different DTypes"), violating the always-SAFE-fallback contract. Same bug + // class as PR-2.3 cycle-1 CAST cross-kind and PR-2.4 cycle-2 comparison + // pairwise-PType must-fixes. Refuse pushdown when the col / lo / hi dtypes + // disagree; recursion through `aexpr_to_vortex_expression` is unsafe without + // this explicit guard. + // + // **Match-arm ordering**: must come BEFORE the general + // `Boolean(boolean_fn)` arm below — pattern matching is order-sensitive and + // the catchall would otherwise match first and return None via its inner + // `_ => None`. + #[cfg(feature = "is_between")] + AExpr::Function { + input, + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { closed }), + .. + } => { + // SCHEMA-GATE: pairwise-PType across col/lo/hi (col must be numeric and match both bound dtypes). + if input.len() != 3 { + return None; + } + let s = schema?; + let col_dt = resolve_inner_dtype(input[0].node(), arena, s)?; + let lo_dt = resolve_inner_dtype(input[1].node(), arena, s)?; + let hi_dt = resolve_inner_dtype(input[2].node(), arena, s)?; + if col_dt != lo_dt || col_dt != hi_dt { + return None; + } + let col = aexpr_to_vortex_expression(input[0].node(), arena, schema)?; + let lo = aexpr_to_vortex_expression(input[1].node(), arena, schema)?; + let hi = aexpr_to_vortex_expression(input[2].node(), arena, schema)?; + let lower = match closed { + ClosedInterval::Both | ClosedInterval::Left => gt_eq(col.clone(), lo), + ClosedInterval::Right | ClosedInterval::None => gt(col.clone(), lo), + }; + let upper = match closed { + ClosedInterval::Both | ClosedInterval::Right => lt_eq(col, hi), + ClosedInterval::Left | ClosedInterval::None => lt(col, hi), + }; + Some(and(lower, upper)) + }, + + // --- is_in (PR-2.7) --- + // `col.is_in([v1, v2, ...])` decomposed to `(col == v1) OR (col == v2) OR ...`. + // Reuses the polars-plan-internal `try_extract_is_in_haystack` helper (per + // `super::column_expr.rs:218`'s same pattern) so the haystack extraction + // logic (constant-eval, list/array dispatch, null-drop) stays consistent + // across the SpecializedColumnPredicate path (deleted in PR-2.6) and this + // AExpr-direct path. Refuses pushdown when `nulls_equal=true` AND the + // haystack contained nulls — pushing requires emitting a `null` scalar into + // the OR, which `polars_scalar_to_vortex` doesn't support (it returns None + // for `AnyValue::Null`); safer to refuse than narrow the predicate. Refuses + // if ANY haystack element fails Polars→Vortex scalar conversion (the pushed + // filter must NOT be narrower than the user's predicate — would drop rows + // incorrectly). **Must come BEFORE the general Boolean(boolean_fn) arm.** + #[cfg(feature = "is_in")] + AExpr::Function { + input, + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsIn { nulls_equal }), + .. + } => { + // SCHEMA-GATE: try_extract_is_in_haystack enforces haystack dtype == column dtype; refuse on nulls_equal + had_nulls. + if input.len() != 2 { + return None; + } + let s = schema?; + let column_dtype = resolve_inner_dtype(input[0].node(), arena, s)?; + let (haystack, had_nulls) = super::try_extract_is_in_haystack( + input[1].node(), + arena, + s, + &column_dtype, + usize::MAX, + )?; + if *nulls_equal && had_nulls { + return None; + } + let col_expr = aexpr_to_vortex_expression(input[0].node(), arena, schema)?; + let n_values = haystack.len(); + let terms: Vec = haystack + .iter() + .filter_map(|av| { + let scalar = Scalar::new(column_dtype.clone(), av.into_static()); + let vs = polars_vortex::read::predicate::polars_scalar_to_vortex(&scalar)?; + Some(eq(col_expr.clone(), lit(vs))) + }) + .collect(); + if terms.len() != n_values { + return None; + } + // `or_collect(vec![])` returns None when the haystack was empty + // (`is_in([])` matches no rows). Polars's post-decode residual + // reapply handles the all-false semantic correctly, so the refuse + // here is a missed optimization (we could pre-empt with + // `lit(VortexScalar::bool(false, NonNullable))`) — not a bug. + // PR-2.7 cycle-2 nit #7. + or_collect(terms) + }, + + // --- unary boolean functions (IsNull / IsNotNull / Not) --- + AExpr::Function { + input, + function: IRFunctionExpr::Boolean(boolean_fn), + .. + } => { + // SCHEMA-GATE: boolean-only for Not (IsNull/IsNotNull accept any dtype). + // All three of IsNull / IsNotNull / Not are unary — one Node input. + // `input` is `Vec`; we take the first and unwrap its node. + let arg_node = input.first().map(|expr_ir| expr_ir.node())?; + // Schema gate for `Not` — Polars `IRBooleanFunction::Not` is bitwise-or- + // logical (boolean.rs:49 `// Also bitwise negate`); Vortex `not` is + // boolean-only. Polars's own `column_expr.rs:245-247` performs the same + // `dtype.is_bool()` guard. IsNull/IsNotNull accept any dtype and produce + // boolean output, so no gate. + if matches!(boolean_fn, IRBooleanFunction::Not) { + let s = schema?; + if !operand_is_bool(arg_node, arena, s) { + return None; + } + } + let arg = aexpr_to_vortex_expression(arg_node, arena, schema)?; + match boolean_fn { + IRBooleanFunction::IsNull => Some(is_null(arg)), + IRBooleanFunction::IsNotNull => Some(is_not_null(arg)), + IRBooleanFunction::Not => Some(not(arg)), + // Other IRBooleanFunction variants (IsIn, IsBetween, AllHorizontal, etc.) + // are not in the PR-2.1 foundation scope; PR-2.2..PR-2.5 may add some. + _ => None, + } + }, + + // --- Struct field access (PR-2.4 / PR-13.4) --- + // `col.struct.field("inner") == "x"` against a struct column pushes down as + // `eq(get_item("inner", get_item("col", root())), lit("x"))`. The Polars AExpr + // shape is `Function { StructExpr(FieldByName(name)), input: [struct_col_expr] }`; + // mirrors vortex-duckdb's `TableFilterClass::StructExtract` precedent at + // `vortex-duckdb/src/convert/table_filter.rs:71-73`. + // + // Schema gate (cycle-2 process lesson from PR-2.3): Vortex's `GetItem.return_dtype` + // (`vortex-array/src/scalar_fn/fns/get_item.rs:94-96`) `vortex_err!`s at scan-time + // if the requested field name isn't in the struct's fields — same hostile-input + // class as the cycle-1 CAST cross-kind bug. Refuse pushdown when the schema is + // unavailable OR when the resolved inner dtype isn't a `Struct(fields)` containing + // the requested field. + #[cfg(feature = "dtype-struct")] + AExpr::Function { + input, + function: IRFunctionExpr::StructExpr(IRStructFunction::FieldByName(name)), + .. + } => { + // SCHEMA-GATE: schema-membership (struct must contain the requested field). + let arg_node = input.first().map(|expr_ir| expr_ir.node())?; + let s = schema?; + let inner_dtype = resolve_inner_dtype(arg_node, arena, s)?; + if !struct_field_exists(&inner_dtype, name) { + return None; + } + let inner = aexpr_to_vortex_expression(arg_node, arena, schema)?; + Some(get_item(name.as_str(), inner)) + }, + + // --- string LIKE-based predicates (PR-2.7): starts_with / ends_with / contains --- + // `col.str.starts_with(p)` → `like(col, lit("p%"))`. + // `col.str.ends_with(s)` → `like(col, lit("%s"))`. + // `col.str.contains{literal:true, ..}(sub)` → `like(col, lit("%sub%"))`. + // + // `contains` with `literal: false` is a regex pattern; Vortex's `like` is + // SQL-LIKE-style (only `%` / `_` wildcards), not regex, so we refuse that + // shape. Other StringExpr variants (Lowercase/Uppercase/Slice/Strptime/...) + // either don't return Boolean or have no Vortex equivalent — refuse via the + // inner `_ => return None`. + // + // The needle is escaped via `bytes_to_like_literal` which refuses if the + // bytes contain SQL-LIKE special characters (`%`, `_`, `\`) since those + // would change the LIKE semantics — *widening* the predicate. Widening is + // technically SOUND under the multi-scan PARTIAL_FILTER post-decode reapply + // (Vortex returns a superset, then Polars trims), but it defeats the + // pushdown's perf win. Same reasoning as the deleted legacy convertor at + // `polars-vortex/src/read/predicate.rs` (pre-PR-2.6 `bytes_to_like_literal`). + // + // **Utf8 input gate (PR-2.7 cycle-1 must-fix #3).** Vortex's `Like` kernel + // (`vortex-array/src/scalar_fn/fns/like/mod.rs:124-132 return_dtype`) + // `vortex_bail!`s with "Cannot apply 'like' to non-Utf8 input" at scan-time + // if `input[0]` isn't Utf8 (Vortex's String dtype). Non-String column inputs + // (e.g., a struct field access producing Int32, or a Cast whose source + // isn't String) would silently push a `like(, )` + // that fails at scan-time, violating the always-SAFE-fallback contract. + // Same bug class as PR-2.3 cycle-1 CAST cross-kind. Refuse pushdown when + // the resolved column dtype isn't `DataType::String`. + #[cfg(feature = "strings")] + AExpr::Function { + input, + function: IRFunctionExpr::StringExpr(string_fn), + .. + } => { + // SCHEMA-GATE: Utf8 input (refuse non-String column before needle extraction; see must-fix #3). + if input.len() != 2 { + return None; + } + // Utf8 input gate (must-fix #3) — refuse non-String column inputs + // BEFORE doing any needle extraction or pattern construction work. + let s = schema?; + let col_dt = resolve_inner_dtype(input[0].node(), arena, s)?; + if col_dt != DataType::String { + return None; + } + // Extract the literal needle string. Only direct Scalar literals are + // handled here; folded expressions (e.g., `lit("a") + lit("b")`) would + // need `constant_evaluate` — out of scope for the foundation port. + let needle_str: String = match arena.get(input[1].node()) { + AExpr::Literal(LiteralValue::Scalar(scalar)) => match scalar.value() { + AnyValue::String(s) => s.to_string(), + AnyValue::StringOwned(s) => s.to_string(), + _ => return None, + }, + _ => return None, + }; + let escaped = bytes_to_like_literal(needle_str.as_bytes())?; + let pattern = match string_fn { + IRStringFunction::StartsWith => format!("{escaped}%"), + IRStringFunction::EndsWith => format!("%{escaped}"), + #[cfg(feature = "regex")] + IRStringFunction::Contains { + literal: true, + strict: _, + } => format!("%{escaped}%"), + // Contains { literal: false, .. } → regex, not LIKE; refuse. + // Other StringExpr variants (Lowercase, Slice, Strptime, etc.) → + // not in this predicate-pushdown scope; refuse. + _ => return None, + }; + let col_expr = aexpr_to_vortex_expression(input[0].node(), arena, schema)?; + Some(like( + col_expr, + lit(VortexScalar::utf8(pattern, Nullability::NonNullable)), + )) + }, + + // --- CAST (PR-2.3 / PR-13.3) --- + // `col.cast(Int64) > 100` against an Int32 column pushes down as + // `gt(cast(get_item("col", root()), DType::Primitive(I64, Nullable)), lit(100i64))`. + // + // Two gates protect the convertor's always-SAFE-fallback contract (PR-2.3 + // cycle-1 must-fix from gauntlet; same bug class as PR-2.2 cycle-1 M2 Plus): + // + // 1. `CastOptions::Strict` only. Polars `NonStrict` (overflow→null) and + // `Overflowing` (wrap) silently diverge from Vortex's fail-on-overflow + // `checked_add`-style cast semantics — pushing those down would convert + // Polars's silent-or-null behavior into a scan-time `ComputeError`. + // The user's query semantics must dominate; refuse pushdown so the + // legacy path / post-decode reapply handles non-Strict. + // 2. Source-dtype-kind compatibility check via `cast_kind_compatible`. + // Vortex's per-array `CastKernel` impls (verified in + // `vortex-array/src/arrays/{primitive,bool,varbinview}/compute/cast.rs`) + // are **strictly within-kind**: Primitive↔Primitive only, Bool↔Bool + // only, Utf8↔Utf8 only (also Binary↔Binary). Cross-kind casts return + // `Ok(None)` from the kernel, which `cast/mod.rs:120` then + // `vortex_bail!`s on with "No CastKernel". The convertor refuses + // cross-kind so the legacy path / post-decode reapply handles them. + // + // The `?`-propagation on both gates yields None for any unsupported + // shape — always SAFE. + AExpr::Cast { + expr: inner, + dtype: target_pl, + options, + } => { + // SCHEMA-GATE: source-kind compat + Strict-options only (refuses cross-kind + non-Strict casts). + if !options.is_strict() { + return None; + } + let target = polars_dtype_to_vortex_dtype(target_pl)?; + // Source-dtype-kind gate. Requires schema; without schema we cannot + // resolve the inner expression's dtype, so conservatively refuse. + let s = schema?; + let source_pl = resolve_inner_dtype(*inner, arena, s)?; + if !cast_kind_compatible(&source_pl, target_pl) { + return None; + } + let child = aexpr_to_vortex_expression(*inner, arena, schema)?; + Some(cast(child, target)) + }, + + // --- Ternary (PR-2.7) --- + // `when(predicate).then(truthy).otherwise(falsy)` maps to Vortex's + // `case_when(condition, then_value, else_value)`. All three children recurse + // through `aexpr_to_vortex_expression`; any unsupported subtree fails + // closed via `?`-propagation. + // + // **Explicit THEN/ELSE pairwise-dtype gate (PR-2.7 cycle-1 must-fix #2).** + // Vortex's `case_when` builder + // (`vortex-array/src/expr/expression.rs:45-62 try_new`) validates only + // arity at builder time; the dtype unification check fires later in + // `CaseWhen::return_dtype` + // (`vortex-array/src/scalar_fn/fns/case_when.rs:185-191`) which + // `vortex_bail!`s with "CaseWhen THEN and ELSE dtypes must match" at + // scan/pruning time. The recursion's CAST/Plus pairwise-PType gates + // protect their own subtrees but NOT the Ternary truthy-vs-falsy + // mismatch. Cross-dtype truthy/falsy (e.g., nested Ternary with + // col_int32 vs col_int64 when TYPE_COERCION is off) bypass any inner + // gate. Same bug class as PR-2.2 cycle-1 Plus must-fix. Refuse pushdown + // when truthy and falsy dtypes disagree. + // + // **Reachability note**: a Ternary at the filter root must be Boolean-typed + // (Polars validates), so the cross-dtype hazard is gated to nested Ternary + // inside a complex predicate. The gate is held to must-fix per the existing + // precedent (Plus / comparison gates) because the always-SAFE-fallback + // contract is binary; reduced reach is mitigation, not category change. + AExpr::Ternary { + predicate, + truthy, + falsy, + } => { + // SCHEMA-GATE: THEN/ELSE pairwise-dtype equality (mirror of Plus/comparison pattern; PR-2.7 cycle-1 must-fix #2). + let s = schema?; + let then_dt = resolve_inner_dtype(*truthy, arena, s)?; + let else_dt = resolve_inner_dtype(*falsy, arena, s)?; + if then_dt != else_dt { + return None; + } + let condition = aexpr_to_vortex_expression(*predicate, arena, schema)?; + let then_value = aexpr_to_vortex_expression(*truthy, arena, schema)?; + let else_value = aexpr_to_vortex_expression(*falsy, arena, schema)?; + Some(case_when(condition, then_value, else_value)) + }, + + // --- unsupported shapes (residual) --- + // Other Function variants (temporal etc.) → PR-2.5 (slipped to Deferred). + // Sort / Gather / Filter / Agg / AnonymousFunction / Over / Rolling etc. all + // fall through to residual unconditionally. + _ => None, + } +} + +/// Per-minterm file-vs-virtual split for the Vortex pushdown convertor (PR-2.8). +/// +/// Walks the top-level conjunction of `root_node` via [`MintermIter`] and converts +/// only the minterms whose leaf column references are all **file** columns — +/// i.e., not present in `virtual_cols`. Virtual-column-touching minterms +/// (hive partition columns, row_index, include_file_paths) are left for the +/// multi-scan layer's `PARTIAL_FILTER` reapply. +/// +/// Returns `Some(Expression)` with the AND-collected file-only minterms, or +/// `None` if either (a) no top-level conjunct is file-only, OR (b) every +/// file-only conjunct individually failed [`aexpr_to_vortex_expression`]. +/// The contract is always-SAFE: the caller treats `None` as "no convertor +/// pushdown for this predicate" and lets the post-decode residual reapply +/// the full predicate. +/// +/// **When `virtual_cols` is empty**, this is equivalent to converting each +/// top-level conjunct independently and AND-collecting — which is STRICTLY +/// BETTER than calling [`aexpr_to_vortex_expression`] on the whole tree +/// because partial conversion is now possible (a predicate like +/// `a == 1 AND unsupported_op(b)` pushes the `a == 1` part instead of +/// refusing entirely). Callers can use this unconditionally as the entrypoint. +/// +/// **Wiring**: replaces the all-or-nothing virtual-column guard at +/// `polars-stream::physical_plan::lower_ir.rs::FileScanIR::Vortex` (PR-2.2 +/// cycle-1 must-fix M1's conservative refuse). The PR-2.8 amend turns that +/// guard into a per-column split so hive-partitioned / row-indexed Vortex +/// scans still get Vortex zone pruning on the file-column part of the +/// predicate. +/// +/// **Why MintermIter**: it walks top-level `And`/`LogicalAnd` conjuncts +/// (`crate::plans::aexpr::minterm_iter`'s docstring) — the natural unit for +/// CNF-style partial pushdown. A top-level OR is returned as a single minterm +/// (correct: if `(file_col == 1) OR (virtual_col == 2)`, the whole OR must +/// fall to residual because Vortex can't see the virtual_col side of the OR). +pub fn aexpr_file_minterms_to_vortex_expression( + root_node: Node, + arena: &Arena, + schema: Option<&Schema>, + virtual_cols: &PlHashSet, +) -> Option { + let exprs: Vec = MintermIter::new(root_node, arena) + .filter(|node| { + // File-only iff NO leaf-column-name appears in virtual_cols. + !aexpr_to_leaf_names_iter(*node, arena).any(|name| virtual_cols.contains(name)) + }) + .filter_map(|node| aexpr_to_vortex_expression(node, arena, schema)) + .collect(); + and_collect(exprs) +} + +/// Convert a Polars [`DataType`] to a Vortex [`DType`] for the CAST arm. Returns `None` +/// for dtypes Vortex doesn't natively represent as a `Primitive`/`Bool`/`Utf8` (Decimal, +/// Object, Categorical/Enum, temporal Extension types). Nullability defaults to +/// `Nullable` because Polars's runtime allows null in any column unless statically proven +/// otherwise; the runtime nullable Vortex dtype is a strict superset and CAST to a +/// nullable type is always safe. +/// +/// **Scope**: PR-2.3 covers only primitive numeric + Bool + Utf8 CAST targets. Decimal +/// is deliberately refused (Vortex `DType::Decimal` requires `DecimalDType(precision, +/// scale)` and polars-vortex hasn't validated CAST-via-`vortex::expr::cast` interactions +/// at the Vortex layer). Other targets are PR-2.4/.5 scope or permanent residuals. +fn polars_dtype_to_vortex_dtype(dt: &DataType) -> Option { + use DataType::*; + let nullable = Nullability::Nullable; + Some(match dt { + Boolean => DType::Bool(nullable), + Int8 => DType::Primitive(PType::I8, nullable), + Int16 => DType::Primitive(PType::I16, nullable), + Int32 => DType::Primitive(PType::I32, nullable), + Int64 => DType::Primitive(PType::I64, nullable), + UInt8 => DType::Primitive(PType::U8, nullable), + UInt16 => DType::Primitive(PType::U16, nullable), + UInt32 => DType::Primitive(PType::U32, nullable), + UInt64 => DType::Primitive(PType::U64, nullable), + Float32 => DType::Primitive(PType::F32, nullable), + Float64 => DType::Primitive(PType::F64, nullable), + String => DType::Utf8(nullable), + // All other dtypes fall through to None via the catch-all. Notably: + // - Decimal: deliberately NOT given an explicit arm even with the `dtype-decimal` + // feature enabled. Vortex `DType::Decimal(DecimalDType(precision, scale), nullable)` + // requires usize ↔ u8 narrow + validation per the project BAN against `as` casts + // on Vortex Decimal precision/scale; deferred until polars-vortex validates + // `vortex::expr::cast` interactions for Decimal scale/precision. + // - Int128/UInt128: NOT in the numeric set because Vortex's `PType` ceiling is + // I64/U64/F64 and `polars_scalar_to_vortex` has no Int128/UInt128 literal arms + // (mirrors `is_vortex_numeric_dtype`'s exclusion for the same reason). + // - Object / Categorical / Enum / Date / Datetime / Time / Duration / Binary / + // List / Struct / Array / Null / Unknown: not in PR-2.3 scope; PR-2.4/.5 may + // add some (e.g., struct field access in PR-2.4). + _ => return None, + }) +} + +/// Resolve the Polars [`DataType`] of `node` for CAST source-dtype gating. +/// +/// Returns `None` for AExpr shapes we cannot trivially type-check (nested CAST +/// chains, expressions producing dtype-dependent output, AnonymousFunction, etc.) — +/// the CAST arm's `?`-propagation then drops the cast to residual, which is always +/// SAFE. +/// +/// Supports the shapes the convertor's own CAST arm cares about: `Column`, +/// `Literal(Scalar)`, and `Cast` (recursive — the cast's output dtype is its target). +/// Comparisons (Eq/Lt/etc.) and Boolean-function outputs (IsNull/IsNotNull/Not) +/// produce Boolean output. Plus produces output matching its operands' numeric type +/// (delegate to the recursive operand_is_numeric machinery). +fn resolve_inner_dtype(node: Node, arena: &Arena, schema: &Schema) -> Option { + match arena.get(node) { + AExpr::Column(name) => schema.get(name).cloned(), + AExpr::Literal(LiteralValue::Scalar(s)) => Some(s.dtype().clone()), + AExpr::Cast { + dtype: target_pl, .. + } => Some(target_pl.clone()), + AExpr::BinaryExpr { left, op, right } => match op { + Operator::Eq + | Operator::NotEq + | Operator::Lt + | Operator::LtEq + | Operator::Gt + | Operator::GtEq + | Operator::EqValidity + | Operator::NotEqValidity + | Operator::LogicalAnd + | Operator::LogicalOr => Some(DataType::Boolean), + // Plus output dtype matches the operand dtypes. Self-contained verification + // (PR-2.4 cycle-2 should-fix F-RESOLVE-PLUS-LHS-DELEGATION-001): don't rely + // on the convertor's Plus arm gate having fired — verify lhs == rhs here so + // any future caller of `resolve_inner_dtype` gets a trustworthy answer. + // Recursive: nested Plus chains `(a + b) + c` resolve correctly. + Operator::Plus => { + let l = resolve_inner_dtype(*left, arena, schema)?; + let r = resolve_inner_dtype(*right, arena, schema)?; + if l == r { Some(l) } else { None } + }, + _ => None, + }, + AExpr::Function { + function: + IRFunctionExpr::Boolean( + IRBooleanFunction::IsNull + | IRBooleanFunction::IsNotNull + | IRBooleanFunction::Not, + ), + .. + } => Some(DataType::Boolean), + // Struct field access — resolve to the inner struct's field dtype. + // Used by the CAST source-kind gate when a Cast wraps a struct field access, + // and by the StructExpr arm's recursive gate to chain through nested structs. + #[cfg(feature = "dtype-struct")] + AExpr::Function { + input, + function: IRFunctionExpr::StructExpr(IRStructFunction::FieldByName(name)), + .. + } => { + let arg_node = input.first().map(|expr_ir| expr_ir.node())?; + let inner = resolve_inner_dtype(arg_node, arena, schema)?; + if let DataType::Struct(fields) = &inner { + fields + .iter() + .find(|f| f.name() == name) + .map(|f| f.dtype().clone()) + } else { + None + } + }, + // Ternary self-contained verification (PR-2.7 cycle-2 should-fix #1). + // Without this arm, the Ternary convertor's pairwise-dtype gate + // (`resolve_inner_dtype(*truthy)?`) would unconditionally return None + // for any NESTED Ternary truthy/falsy and `?`-propagate → the outer + // Ternary refuses pushdown even when the inner Ternary's truthy/falsy + // dtypes match. Fail-closed-safe but a coverage regression vs. the + // pre-gate baseline. The recursive resolution mirrors the Plus arm's + // self-contained verification at L726-730: walk truthy + falsy, + // require equality (Vortex's `case_when` builder requires THEN and + // ELSE share a return_dtype), unresolvable shapes fall through to + // None → conservative refuse. Note: this arm makes `resolve_inner_dtype` + // walk a Ternary tree even when the OUTER predicate isn't a Ternary + // (e.g., `eq(when().then().otherwise(), 5)` resolves the lhs by + // recursing through the Ternary). Acceptable: typical predicate depth + // is shallow. + AExpr::Ternary { truthy, falsy, .. } => { + let t = resolve_inner_dtype(*truthy, arena, schema)?; + let e = resolve_inner_dtype(*falsy, arena, schema)?; + if t == e { Some(t) } else { None } + }, + _ => None, + } +} + +/// Does `dtype` contain a struct field named `name`? Used by the PR-2.4 StructField gate. +/// +/// Returns `false` for non-Struct dtypes. Conservatively returns `false` if the dtype +/// isn't a Struct so the caller's None-fallback drops the StructField arm to residual. +#[cfg(feature = "dtype-struct")] +fn struct_field_exists(dtype: &DataType, name: &polars_utils::pl_str::PlSmallStr) -> bool { + if let DataType::Struct(fields) = dtype { + fields.iter().any(|f| f.name() == name) + } else { + false + } +} + +/// Is the (source, target) CAST pair representable by Vortex's per-array +/// `CastKernel` impls? Verified against vortex-array 0.70.0: +/// +/// - `Primitive::cast` returns `Ok(None)` for non-Primitive targets +/// (`arrays/primitive/compute/cast.rs:62-64`) +/// - `Bool::cast` returns `Ok(None)` for non-Bool targets +/// (`arrays/bool/compute/cast.rs:41-43`) +/// - `VarBinView::cast` returns `Ok(None)` unless source AND target are both Utf8 +/// (or both Binary) (`arrays/varbinview/compute/cast.rs:60-62`) +/// +/// Cross-kind casts (Primitive→Bool, Bool→Utf8, Utf8→Primitive, etc.) cause +/// `cast/mod.rs:120` to `vortex_bail!("No CastKernel to cast canonical array {} from +/// {} to {}")` at scan-time, which propagates as a hard `ComputeError`. The +/// convertor refuses cross-kind so the residual / legacy path handles them. +/// +/// Same-kind narrowing (Int32→Int8, Int64→Int32) is allowed; Vortex's +/// `values_fit_in` check at `cast.rs:85-91` produces a scan-time `vortex_bail!` on +/// out-of-range values. This is acceptable under `CastOptions::Strict` semantics +/// (which is the only mode we push down — the cycle-1 gate above). +fn cast_kind_compatible(source: &DataType, target: &DataType) -> bool { + use DataType::*; + matches!((source, target), (Boolean, Boolean) | (String, String)) + || (is_vortex_numeric_dtype(source) && is_vortex_numeric_dtype(target)) +} + +/// Schema-aware operand type check: determines whether `node`'s resolved dtype is +/// `Boolean`. Used by the And/Or/Not schema gate to refuse bitwise-on-integer pushdown. +/// +/// Conservatively returns `false` when the dtype can't be resolved (unknown column, +/// nested expression we can't trivially type-check). The caller's None-fallback then +/// drops the And/Or/Not arm to residual — always SOUND. +fn operand_is_bool(node: Node, arena: &Arena, schema: &Schema) -> bool { + match arena.get(node) { + AExpr::Column(name) => matches!(schema.get(name), Some(DataType::Boolean)), + AExpr::Literal(LiteralValue::Scalar(s)) => matches!(s.dtype(), DataType::Boolean), + AExpr::BinaryExpr { left, op, right } => match op { + // Comparisons unconditionally produce Boolean. + Operator::Eq + | Operator::NotEq + | Operator::Lt + | Operator::LtEq + | Operator::Gt + | Operator::GtEq + | Operator::EqValidity + | Operator::NotEqValidity + // LogicalAnd/LogicalOr are the IR-level "logical" form; by construction + // their inputs are boolean-typed. + | Operator::LogicalAnd + | Operator::LogicalOr => true, + // And/Or follow operand dtype: bool-input → bool-output (logical), int-input + // → int-output (bitwise). Recurse on inputs to determine. + Operator::And | Operator::Or => { + operand_is_bool(*left, arena, schema) && operand_is_bool(*right, arena, schema) + }, + // Arithmetic and bitwise Xor produce non-bool output. + _ => false, + }, + // Enumerate the IRBooleanFunction variants we know produce Boolean array output. + // `Not` is bitwise-or-logical (output dtype = input dtype), so we recurse on its + // arg. Other variants (Any/All/IsEmpty produce scalar bool, not array bool; the + // outer convertor returns None for those anyway via the `_ => None` arm in the + // Function match) are conservatively treated as non-bool here — the convertor's + // own arm-level None fallback is the second line of defense. + AExpr::Function { + input, + function: IRFunctionExpr::Boolean(bf), + .. + } => match bf { + IRBooleanFunction::IsNull | IRBooleanFunction::IsNotNull => true, + IRBooleanFunction::Not => input + .first() + .map(|expr_ir| operand_is_bool(expr_ir.node(), arena, schema)) + .unwrap_or(false), + _ => false, + }, + _ => false, + } +} + +/// Schema-aware operand type check: determines whether `node`'s resolved dtype is a +/// numeric primitive compatible with Vortex's `checked_add`. Used by the Plus gate to +/// refuse pushdown on String / Bool / Date / Struct / List operands that Vortex's +/// `Binary::coerce_args` would `vortex_bail!` on at scan-time. +/// +/// Conservatively returns `false` when the dtype can't be resolved (unknown column, +/// nested expression). The caller's None-fallback drops the Plus arm to residual. +fn operand_is_numeric(node: Node, arena: &Arena, schema: &Schema) -> bool { + match arena.get(node) { + AExpr::Column(name) => schema.get(name).is_some_and(is_vortex_numeric_dtype), + AExpr::Literal(LiteralValue::Scalar(s)) => is_vortex_numeric_dtype(s.dtype()), + // Recursive `Plus` produces numeric output if both operands are numeric; this + // lets `(col + 1) + 1` push down (operands at each level are numeric). + AExpr::BinaryExpr { + left, + op: Operator::Plus, + right, + } => operand_is_numeric(*left, arena, schema) && operand_is_numeric(*right, arena, schema), + // CAST to a numeric target produces numeric output. This lets + // `Cast(int32_col, Int64) + lit_i64` push down (the inner Cast aligns the dtype + // for Vortex's same-PType `checked_add` requirement). + AExpr::Cast { dtype, .. } => is_vortex_numeric_dtype(dtype), + _ => false, + } +} + +/// Is `dt` one of the Vortex-primitive dtypes that `checked_add` accepts? +/// Mirrors Vortex's `is_primitive() && eq_ignore_nullability` precondition (numeric +/// integer + float). Decimals are deliberately NOT numeric here — Vortex's `Decimal` is +/// primitive but `checked_add` on Decimals has scale/precision interactions polars-vortex +/// hasn't validated; refuse for safety. +fn is_vortex_numeric_dtype(dt: &DataType) -> bool { + use DataType::*; + // List mirrors Vortex's `PType` ceiling (I8/I16/I32/I64/F16/F32/F64 + unsigned) + // AND `polars_vortex::read::predicate::polars_scalar_to_vortex`'s supported literal + // arms. `Int128` and `UInt128` are intentionally excluded: both exist in polars-core + // (`DataType::Int128`/`UInt128`) but neither is a Vortex primitive AND + // `polars_scalar_to_vortex` has no Int128/UInt128 arms, so a literal-of-that-type + // would fall through `?`-propagation anyway. Listing them here would be misleading + // (suggesting support that isn't wired). + matches!( + dt, + Int8 | Int16 | Int32 | Int64 | UInt8 | UInt16 | UInt32 | UInt64 | Float32 | Float64 + ) +} + +/// Convert a Polars [`LiteralValue`] to a Vortex literal [`Expression`]. +/// +/// Only [`LiteralValue::Scalar`] is handled in the PR-2.1 foundation; [`LiteralValue::Dyn`], +/// [`LiteralValue::Series`], and [`LiteralValue::Range`] fall through to `None`. `Dyn` +/// requires type-inference context (target dtype) which the convertor doesn't have access +/// to without a schema; `Series` and `Range` aren't sensible predicates anyway. +fn convert_literal(lv: &LiteralValue) -> Option { + match lv { + LiteralValue::Scalar(scalar) => Some(lit( + polars_vortex::read::predicate::polars_scalar_to_vortex(scalar)?, + )), + // Dyn requires materialization against a target dtype; not in foundation scope. + // Series / Range aren't valid predicate literals. + _ => None, + } +} + +/// Validate that `bytes` is valid UTF-8 and free of SQL-LIKE special characters +/// (`%`, `_`, `\`). Returns the borrowed `&str` so the caller can build a pattern. +/// Returning `None` falls back to the residual filter, which is always correct. +/// +/// Refuses pushdown when bytes contain `%` or `_` because Vortex's LIKE would +/// interpret those as wildcards — *widening* the predicate. Widening is sound +/// under the multi-scan `PARTIAL_FILTER` post-decode reapply (Vortex returns a +/// superset, then Polars trims), but defeats the perf win of pushing down. +/// Backslash is LIKE's escape character; same reasoning. +/// +/// Resurrected verbatim from the legacy +/// `polars-vortex/src/read/predicate.rs::bytes_to_like_literal` (deleted in +/// PR-2.6's Option B → A cutover); the PR-2.7 amend re-introduces it inline in +/// the convertor so polars-plan owns its own LIKE-pattern escaping. Gated on +/// `strings` since the sole caller is the StringExpr arm (PR-2.7 cycle-2 +/// should-fix #2 — avoid unused-fn warning under `--no-default-features +/// --features vortex`). +#[cfg(feature = "strings")] +fn bytes_to_like_literal(bytes: &[u8]) -> Option<&str> { + let s = std::str::from_utf8(bytes).ok()?; + if s.contains('%') || s.contains('_') || s.contains('\\') { + return None; + } + Some(s) +} + +#[cfg(test)] +mod tests { + //! Unit tests for the PR-2.1 / PR-13.1 foundation shapes + PR-2.2 / PR-13.2 Plus + + //! the schema-gate refusal paths (And/Or/Not bitwise-on-int + Plus on non-numeric). + //! + //! Each test builds a small AExpr tree directly in an `Arena` (no DSL involved), + //! calls [`aexpr_to_vortex_expression`], and asserts the conversion returns `Some`. For + //! most shapes the Vortex expression's structure is opaque to the test — we trust the + //! builder helpers and assert `.is_some()` / `.is_none()` only. The cycle-1 must-fix + //! escalation around tautological tests is addressed for the load-bearing Plus shape + //! via `shape_plus_arithmetic_structural`, which inspects `Display::fmt`'s SQL-form + //! output — a paste-swap bug (`Plus → checked_mul`) would be caught there. + //! Unsupported shapes additionally assert `None`. + use polars_core::chunked_array::cast::CastOptions; + use polars_core::prelude::{AnyValue, DataType}; + use polars_core::scalar::Scalar; + use polars_utils::arena::Arena; + use polars_utils::pl_str::PlSmallStr; + + use super::*; + use crate::plans::lit::DynLiteralValue; + use crate::plans::{ExprIR, OutputName}; + use crate::prelude::FunctionOptions; + + /// Helper: add an `AExpr::Column(name)` to the arena and return its node id. + fn col(arena: &mut Arena, name: &str) -> Node { + arena.add(AExpr::Column(PlSmallStr::from(name))) + } + + /// Helper: add an `AExpr::Literal(Scalar())` literal. + fn lit_i32(arena: &mut Arena, value: i32) -> Node { + let scalar = Scalar::new(DataType::Int32, AnyValue::Int32(value)); + arena.add(AExpr::Literal(LiteralValue::Scalar(scalar))) + } + + /// Helper: build a `BinaryExpr` with the given operator. + fn binop(arena: &mut Arena, left: Node, op: Operator, right: Node) -> Node { + arena.add(AExpr::BinaryExpr { left, op, right }) + } + + /// Helper: build an `AExpr::Function` with a Boolean function variant. + fn boolean_fn(arena: &mut Arena, bf: IRBooleanFunction, arg: Node) -> Node { + // ExprIR::new takes a node + OutputName; for predicate-arena tests we use a dummy + // empty alias (the OutputName isn't consulted by the convertor). + let expr_ir = ExprIR::new(arg, OutputName::Alias(PlSmallStr::EMPTY)); + arena.add(AExpr::Function { + input: vec![expr_ir], + function: IRFunctionExpr::Boolean(bf), + options: FunctionOptions::default(), + }) + } + + // === Shape coverage tests (15 shapes: 13 foundation + Plus + Cast) === + + #[test] + fn shape_column() { + let mut arena = Arena::new(); + let n = col(&mut arena, "a"); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_some()); + } + + #[test] + fn shape_literal_scalar() { + let mut arena = Arena::new(); + let n = lit_i32(&mut arena, 42); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_some()); + } + + // The comparison tests below pass `Some(&schema_a_b_int32())` because the + // PR-2.4 cycle-2 should-fix F-COMPARE-CROSS-PTYPE-001 added a comparison + // pairwise-equal-PType gate: comparisons require schema to verify operand + // dtypes match (mirroring the Plus gate's discipline). Without schema, the + // gate conservatively refuses — see `shape_eq_without_schema_returns_none`. + + #[test] + fn shape_eq() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::Eq, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_not_eq() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::NotEq, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_lt() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::Lt, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_lt_eq() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::LtEq, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_gt() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::Gt, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_gt_eq() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::GtEq, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Comparison without schema → conservative refuse (PR-2.4 cycle-2 comparison + /// pairwise-equal-PType gate from F-COMPARE-CROSS-PTYPE-001). + #[test] + fn shape_eq_without_schema_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let n = binop(&mut arena, c, Operator::Eq, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// Comparison on cross-PType operands (`Int32 == Int64`) — refused by the + /// pairwise-equal-PType gate. Without the gate, Vortex's `Binary::return_dtype` + /// would `vortex_bail!` at scan-time. PR-2.4 cycle-2 F-COMPARE-CROSS-PTYPE-001. + #[test] + fn shape_eq_cross_ptype_returns_none() { + use polars_core::prelude::AnyValue; + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); // Int32 + let l = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(42), + )))); + let n = binop(&mut arena, c, Operator::Eq, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Comparison on cross-PType (`Int32 < Float64`) — refused by the gate. + #[test] + fn shape_lt_cross_ptype_returns_none() { + use polars_core::prelude::AnyValue; + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); // Int32 + let l = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Float64, + AnyValue::Float64(1.0), + )))); + let n = binop(&mut arena, c, Operator::Lt, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Helper: build a small schema with Int32 columns `a` and `b` for And/Or tests. + /// The And/Or schema gate refuses pushdown when an operand is non-bool, but + /// comparison-shape operands (eq/lt/etc.) produce Boolean output and pass the gate. + fn schema_a_b_int32() -> Schema { + let mut s = Schema::default(); + s.with_column(PlSmallStr::from("a"), DataType::Int32); + s.with_column(PlSmallStr::from("b"), DataType::Int32); + s + } + + /// Helper: schema with `a: Boolean` for the Not test. + fn schema_a_bool() -> Schema { + let mut s = Schema::default(); + s.with_column(PlSmallStr::from("a"), DataType::Boolean); + s + } + + #[test] + fn shape_and() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let eq_node = binop(&mut arena, c, Operator::Eq, l); + let c2 = col(&mut arena, "b"); + let l2 = lit_i32(&mut arena, 7); + let lt_node = binop(&mut arena, c2, Operator::Lt, l2); + let n = binop(&mut arena, eq_node, Operator::And, lt_node); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_or() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let eq_node = binop(&mut arena, c, Operator::Eq, l); + let c2 = col(&mut arena, "b"); + let l2 = lit_i32(&mut arena, 7); + let lt_node = binop(&mut arena, c2, Operator::Lt, l2); + let n = binop(&mut arena, eq_node, Operator::Or, lt_node); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// And without a schema → conservative refuse (schema gate's None branch). + #[test] + fn shape_and_without_schema_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let eq_node = binop(&mut arena, c, Operator::Eq, l); + let c2 = col(&mut arena, "b"); + let l2 = lit_i32(&mut arena, 7); + let lt_node = binop(&mut arena, c2, Operator::Lt, l2); + let n = binop(&mut arena, eq_node, Operator::And, lt_node); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// And on integer-bitwise operands → schema gate refuses (PR-2.1 cycle-1 should-fix). + #[test] + fn shape_and_bitwise_int_returns_none() { + // `col_a & col_b` where both are Int32 — Polars `Operator::And` is bitwise here, + // not logical. The gate must refuse to avoid emitting a Vortex `and(int, int)`. + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let n = binop(&mut arena, a, Operator::And, b); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + #[test] + fn shape_logical_and() { + // LogicalAnd is the short-circuit form Polars uses internally for boolean + // simplification; it should map to vortex::expr::and same as Operator::And. + // Inner Eq/Lt comparisons fire the PR-2.4 cycle-2 comparison gate, so schema + // is required. + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let left = binop(&mut arena, c, Operator::Eq, l); + let c2 = col(&mut arena, "b"); + let l2 = lit_i32(&mut arena, 7); + let right = binop(&mut arena, c2, Operator::Lt, l2); + let n = binop(&mut arena, left, Operator::LogicalAnd, right); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_logical_or() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let left = binop(&mut arena, c, Operator::Eq, l); + let c2 = col(&mut arena, "b"); + let l2 = lit_i32(&mut arena, 7); + let right = binop(&mut arena, c2, Operator::Lt, l2); + let n = binop(&mut arena, left, Operator::LogicalOr, right); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + #[test] + fn shape_is_null() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = boolean_fn(&mut arena, IRBooleanFunction::IsNull, c); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_some()); + } + + #[test] + fn shape_is_not_null() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = boolean_fn(&mut arena, IRBooleanFunction::IsNotNull, c); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_some()); + } + + #[test] + fn shape_not() { + // Wraps `eq(col_a, 42)` which is a Boolean comparison, so the schema gate passes + // (the comparison output is Boolean). No schema needed because the inner Eq + // doesn't fire the gate (only And/Or do). + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let eq_node = binop(&mut arena, c, Operator::Eq, l); + let n = boolean_fn(&mut arena, IRBooleanFunction::Not, eq_node); + // Not needs a schema to gate; passing schema with no-op type info works because + // `operand_is_bool` sees the inner BinaryExpr is a comparison → Boolean output. + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Not without schema → conservative refuse. + #[test] + fn shape_not_without_schema_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 42); + let eq_node = binop(&mut arena, c, Operator::Eq, l); + let n = boolean_fn(&mut arena, IRBooleanFunction::Not, eq_node); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// Not on an integer column → schema gate refuses (Polars `Not` is bitwise on ints). + #[test] + fn shape_not_bitwise_int_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let n = boolean_fn(&mut arena, IRBooleanFunction::Not, a); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Not on a Boolean column → schema gate passes. + #[test] + fn shape_not_bool_column_passes() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let n = boolean_fn(&mut arena, IRBooleanFunction::Not, a); + let schema = schema_a_bool(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Plus arithmetic — PR-2.2 ships this via `vortex::expr::checked_add`. Numeric + /// operands required (the cycle-1 must-fix gate). + #[test] + fn shape_plus_arithmetic() { + // `(col_a + 1) == 5` — typical PR-13.2 acceptance shape per the plan. + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let a_plus_1 = binop(&mut arena, a, Operator::Plus, one); + let five = lit_i32(&mut arena, 5); + let n = binop(&mut arena, a_plus_1, Operator::Eq, five); + // Plus numeric gate fires; supply Int32 schema for `a`. + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + // === Unsupported-shapes-return-None coverage === + + /// PR-2.2 ships Plus → checked_add when operands are numeric and schema is provided. + /// Pre-PR-2.2 this test asserted None; now it must assert Some. + #[test] + fn shape_plus_ships_in_pr_2_2() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 1); + let n = binop(&mut arena, c, Operator::Plus, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Plus without schema → conservative refuse (Plus gate's None branch). + #[test] + fn shape_plus_without_schema_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 1); + let n = binop(&mut arena, c, Operator::Plus, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// Plus on a Boolean column → numeric gate refuses (Vortex `checked_add` would + /// `vortex_bail!` at scan-time). + #[test] + fn shape_plus_bool_column_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let n = binop(&mut arena, a, Operator::Plus, one); + let schema = schema_a_bool(); // `a` is Boolean here + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Plus on a String column → numeric gate refuses (Polars allows Plus-as-concat; + /// Vortex `checked_add` does not). + #[test] + fn shape_plus_string_column_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let n = binop(&mut arena, a, Operator::Plus, b); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::String); + schema.with_column(PlSmallStr::from("b"), DataType::String); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Plus on Float64 columns → numeric gate passes (Float is in the numeric set). + #[test] + fn shape_plus_float_column_passes() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let n = binop(&mut arena, a, Operator::Plus, b); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::Float64); + schema.with_column(PlSmallStr::from("b"), DataType::Float64); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Nested Plus `(a + b) + c` — `operand_is_numeric` recurses through the inner Plus + /// (the `Operator::Plus` arm of `operand_is_numeric`) so all three Int32 columns pass + /// the gate. + #[test] + fn shape_plus_nested_numeric_passes() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let a_plus_b = binop(&mut arena, a, Operator::Plus, b); + let c = col(&mut arena, "c"); + let n = binop(&mut arena, a_plus_b, Operator::Plus, c); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::Int32); + schema.with_column(PlSmallStr::from("b"), DataType::Int32); + schema.with_column(PlSmallStr::from("c"), DataType::Int32); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Plus with a non-Plus BinaryExpr operand `(a * b) + c` — the inner Multiply + /// is not yet supported by the convertor (returns None at the outer level via the + /// Plus arm's `?`-propagation on `lhs`), but the gate ALSO refuses because + /// `operand_is_numeric` only recurses on the inner `Plus` arm; any other BinaryExpr + /// op returns false. Both layers refuse: the gate is the first line of defense, the + /// unsupported Multiply arm is the second. + #[test] + fn shape_plus_with_multiply_operand_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let a_times_b = binop(&mut arena, a, Operator::Multiply, b); + let c = col(&mut arena, "c"); + let n = binop(&mut arena, a_times_b, Operator::Plus, c); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::Int32); + schema.with_column(PlSmallStr::from("b"), DataType::Int32); + schema.with_column(PlSmallStr::from("c"), DataType::Int32); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Plus on cross-PType operands `Int32 + Int64` — refused by the PR-2.4 proactive + /// pairwise-equal-PType gate (addresses PR-2.3 cycle-2 H4 self-reinforcement + /// finding). Without the gate, Vortex's `Binary::return_dtype` would `vortex_bail!` + /// at scan-time because `Int32.eq_ignore_nullability(Int64) == false`. + #[test] + fn shape_plus_cross_ptype_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); // Int32 per schema_a_b_int32 + // Build an Int64 literal (PType differs from the Int32 column). + use polars_core::prelude::AnyValue; + let one_i64 = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(1), + )))); + let n = binop(&mut arena, a, Operator::Plus, one_i64); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Plus on Int + Float — both numeric, different PType. Refused by pairwise gate. + /// (PR-2.4 cycle-2 should-fix F-PLUS-CROSS-FLOAT-INT-TEST-001.) + #[test] + fn shape_plus_int_plus_float_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); // Int32 + use polars_core::prelude::AnyValue; + let one_f64 = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Float64, + AnyValue::Float64(1.0), + )))); + let n = binop(&mut arena, a, Operator::Plus, one_f64); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Plus on UInt + Int — both numeric, different signedness. Refused by pairwise gate. + /// (PR-2.4 cycle-2 nit N2.) + #[test] + fn shape_plus_uint_plus_int_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); // UInt32 per this test's schema + use polars_core::prelude::AnyValue; + let one_i32 = lit_i32(&mut arena, 1); + let n = binop(&mut arena, a, Operator::Plus, one_i32); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::UInt32); + let _ = AnyValue::UInt32(1); // explicit construction of the literal type checked elsewhere + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Plus on same-PType operands wrapped through a CAST `Cast(col_int32, Int64) + lit_i64` + /// — passes the pairwise gate (resolve_inner_dtype follows the Cast). Verifies the + /// gate doesn't over-refuse when TYPE_COERCION did its job. + #[test] + fn shape_plus_cast_then_same_ptype_passes() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); // Int32 per schema_a_b_int32 + let cast_a = arena.add(AExpr::Cast { + expr: a, + dtype: DataType::Int64, + options: CastOptions::Strict, + }); + use polars_core::prelude::AnyValue; + let one_i64 = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(1), + )))); + let n = binop(&mut arena, cast_a, Operator::Plus, one_i64); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Structural assertion (cycle-1 must-fix from gauntlet — addresses the + /// tautological-test concern carried forward from PR-2.1 cycle-1, for this PR's most + /// load-bearing new shape). Verifies the Plus → `checked_add` mapping actually + /// produces the expected Vortex `Expression` shape, not just `.is_some()`. A + /// paste-swap bug (e.g., `Operator::Plus => checked_mul(...)`) would be caught here. + #[test] + fn shape_plus_arithmetic_structural() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let a_plus_1 = binop(&mut arena, a, Operator::Plus, one); + let five = lit_i32(&mut arena, 5); + let n = binop(&mut arena, a_plus_1, Operator::Eq, five); + let schema = schema_a_b_int32(); + let expr = aexpr_to_vortex_expression(n, &arena, Some(&schema)).expect("Some"); + // Vortex's SQL-form Display produces a stable string. The exact format may + // evolve across Vortex releases; assert only the recognizable structural + // anchors (operator names + literal values + the column reference) rather + // than the full string. + let s = format!("{}", expr); + assert!( + s.contains("checked_add") || s.contains("+"), + "expected checked_add or '+' in {s}" + ); + assert!(s.contains("a"), "expected column 'a' in {s}"); + assert!(s.contains("1"), "expected literal 1 in {s}"); + assert!(s.contains("5"), "expected literal 5 in {s}"); + // Sanity: the outer-Eq structure should be visible. + assert!( + s.contains("=") || s.contains("eq"), + "expected eq operator in {s}" + ); + } + + /// Minus / Multiply / Divide remain residual until upstream Vortex exposes + /// `checked_sub`/`checked_mul`/`checked_div` as public builders. (As of vortex + /// 0.70.0 only `checked_add` is exposed.) + #[test] + fn unsupported_minus_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 1); + let n = binop(&mut arena, c, Operator::Minus, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + #[test] + fn unsupported_multiply_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 2); + let n = binop(&mut arena, c, Operator::Multiply, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + #[test] + fn unsupported_xor_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 1); + let n = binop(&mut arena, c, Operator::Xor, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + #[test] + fn unsupported_eq_validity_returns_none() { + // EqValidity is the null-aware equality variant Vortex doesn't have a direct + // equivalent for; the convertor explicitly returns None at the match arm. + // (cycle-1 fresh-lens F-002.) + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 1); + let n = binop(&mut arena, c, Operator::EqValidity, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + #[test] + fn unsupported_not_eq_validity_returns_none() { + // NotEqValidity — paired with EqValidity. (cycle-1 fresh-lens F-002.) + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let l = lit_i32(&mut arena, 1); + let n = binop(&mut arena, c, Operator::NotEqValidity, l); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + // === PR-2.3 / PR-13.3: CAST in predicates === + + /// Helper: schema with String columns `a` and `b` for cross-kind CAST tests. + fn schema_a_b_string() -> Schema { + let mut s = Schema::default(); + s.with_column(PlSmallStr::from("a"), DataType::String); + s.with_column(PlSmallStr::from("b"), DataType::String); + s + } + + /// CAST Int32 → Int64 (same kind: Primitive → Primitive) ships in PR-2.3 — + /// was pre-PR-2.3 None, now Some with schema. + #[test] + fn shape_cast_to_int64_ships_in_pr_2_3() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::Strict, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// CAST Int32 → Float64 (same kind: Primitive → Primitive) — supported. + #[test] + fn shape_cast_to_float64() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Float64, + options: CastOptions::Strict, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// CAST Boolean → Boolean (degenerate same-kind: validity widening) — supported. + #[test] + fn shape_cast_bool_to_bool() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Boolean, + options: CastOptions::Strict, + }); + let schema = schema_a_bool(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// CAST String → String (degenerate same-kind: validity widening) — supported. + #[test] + fn shape_cast_string_to_string() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::String, + options: CastOptions::Strict, + }); + let schema = schema_a_b_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// CAST Int32 → Boolean (cross-kind: Primitive → Bool) — refused (PR-2.3 + /// cycle-1 must-fix). Vortex's `Primitive::CastKernel` returns `Ok(None)` for + /// non-Primitive targets and `cast/mod.rs:120` then `vortex_bail!`s. + #[test] + fn shape_cast_int_to_bool_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Boolean, + options: CastOptions::Strict, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST Int32 → String (cross-kind: Primitive → Utf8) — refused. + #[test] + fn shape_cast_int_to_string_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::String, + options: CastOptions::Strict, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST Bool → Int (cross-kind) — refused. + #[test] + fn shape_cast_bool_to_int_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::Strict, + }); + let schema = schema_a_bool(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST String → Int (cross-kind) — refused. + #[test] + fn shape_cast_string_to_int_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::Strict, + }); + let schema = schema_a_b_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST Bool → String (cross-kind) — refused (PR-2.3 cycle-2 C2-CAST-001). + #[test] + fn shape_cast_bool_to_string_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::String, + options: CastOptions::Strict, + }); + let schema = schema_a_bool(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST String → Bool (cross-kind) — refused (PR-2.3 cycle-2 C2-CAST-001). + #[test] + fn shape_cast_string_to_bool_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Boolean, + options: CastOptions::Strict, + }); + let schema = schema_a_b_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST without schema → conservative refuse (cycle-1 must-fix: the + /// source-dtype gate cannot resolve without schema). + #[test] + fn shape_cast_without_schema_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::Strict, + }); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// CAST with `CastOptions::NonStrict` — refused (PR-2.3 cycle-1 must-fix). + /// Polars NonStrict overflow → null differs from Vortex's fail-on-overflow; + /// pushing down would convert Polars's null-on-overflow into a scan error. + #[test] + fn shape_cast_non_strict_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::NonStrict, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST with `CastOptions::Overflowing` — refused (PR-2.3 cycle-1 must-fix). + /// Polars Overflowing wraps on overflow; Vortex errors. Same divergence as + /// the NonStrict case. + #[test] + fn shape_cast_overflowing_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::Overflowing, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST to Decimal — refused at the dtype-mapper level (Vortex Decimal + /// scale/precision interactions are NOT validated at the polars-vortex layer; + /// tracked in the function doc as deferred). + #[cfg(feature = "dtype-decimal")] + #[test] + fn shape_cast_to_decimal_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let n = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Decimal(10, 2), + options: CastOptions::Strict, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// CAST nested in a comparison — `col.cast(Int64) > 100` per the plan's PR-13.3 + /// acceptance test. The literal is `lit_i64` (not `lit_i32`) to mirror the + /// post-TYPE_COERCION shape that production AExpr produces; the PR-2.4 cycle-2 + /// comparison pairwise gate would refuse if the operand dtypes differed. + #[test] + fn shape_cast_then_compare() { + use polars_core::prelude::AnyValue; + let mut arena = Arena::new(); + let c = col(&mut arena, "a"); + let cast_node = arena.add(AExpr::Cast { + expr: c, + dtype: DataType::Int64, + options: CastOptions::Strict, + }); + let l = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(100), + )))); + let n = binop(&mut arena, cast_node, Operator::Gt, l); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + // === PR-2.4 / PR-13.4: Struct field access in predicates === + + /// Helper: build `AExpr::Function` for `IRStructFunction::FieldByName(name)`. + #[cfg(feature = "dtype-struct")] + fn struct_field(arena: &mut Arena, name: &str, arg: Node) -> Node { + let expr_ir = ExprIR::new(arg, OutputName::Alias(PlSmallStr::EMPTY)); + arena.add(AExpr::Function { + input: vec![expr_ir], + function: IRFunctionExpr::StructExpr(IRStructFunction::FieldByName(PlSmallStr::from( + name, + ))), + options: Default::default(), + }) + } + + /// Helper: schema where `s` is `Struct { inner: String, count: Int32 }`. + #[cfg(feature = "dtype-struct")] + fn schema_struct() -> Schema { + use polars_core::prelude::Field; + let mut s = Schema::default(); + let struct_dtype = DataType::Struct(vec![ + Field::new(PlSmallStr::from("inner"), DataType::String), + Field::new(PlSmallStr::from("count"), DataType::Int32), + ]); + s.with_column(PlSmallStr::from("s"), struct_dtype); + s + } + + /// PR-13.4 acceptance: `col.struct.field("inner") == "x"` against a + /// `Struct { inner: String, .. }` column pushes down. + #[cfg(feature = "dtype-struct")] + #[test] + fn shape_struct_field_then_compare() { + let mut arena = Arena::new(); + let s = col(&mut arena, "s"); + let field_node = struct_field(&mut arena, "inner", s); + // Literal "x" — use String scalar. + use polars_core::prelude::AnyValue; + let lit_node = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::String, + AnyValue::StringOwned(PlSmallStr::from("x")), + )))); + let n = binop(&mut arena, field_node, Operator::Eq, lit_node); + let schema = schema_struct(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Struct field access without schema → conservative refuse (the gate cannot + /// verify the field exists in the struct's dtype). + #[cfg(feature = "dtype-struct")] + #[test] + fn shape_struct_field_without_schema_returns_none() { + let mut arena = Arena::new(); + let s = col(&mut arena, "s"); + let n = struct_field(&mut arena, "inner", s); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// Struct field access referencing a non-existent field → refuse (Vortex + /// `GetItem.return_dtype` would `vortex_err!` at scan-time). + #[cfg(feature = "dtype-struct")] + #[test] + fn shape_struct_field_unknown_field_returns_none() { + let mut arena = Arena::new(); + let s = col(&mut arena, "s"); + let n = struct_field(&mut arena, "nonexistent", s); + let schema = schema_struct(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Struct field access on a non-struct column → refuse (struct_field_exists + /// returns false for non-Struct dtype). + #[cfg(feature = "dtype-struct")] + #[test] + fn shape_struct_field_on_int_column_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let n = struct_field(&mut arena, "inner", a); + let schema = schema_a_b_int32(); // `a` is Int32, not Struct + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Nested struct access `s.field("outer").field("inner")` — the convertor + /// recurses through resolve_inner_dtype's StructExpr arm. + #[cfg(feature = "dtype-struct")] + #[test] + fn shape_struct_field_nested() { + use polars_core::prelude::Field; + let mut arena = Arena::new(); + let s = col(&mut arena, "s"); + let outer = struct_field(&mut arena, "outer", s); + let inner = struct_field(&mut arena, "inner", outer); + // Schema: `s: Struct { outer: Struct { inner: String } }`. + let inner_struct = DataType::Struct(vec![Field::new( + PlSmallStr::from("inner"), + DataType::String, + )]); + let outer_struct = + DataType::Struct(vec![Field::new(PlSmallStr::from("outer"), inner_struct)]); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("s"), outer_struct); + assert!(aexpr_to_vortex_expression(inner, &arena, Some(&schema)).is_some()); + } + + #[test] + fn unsupported_literal_dyn_returns_none() { + // Dyn literals need a target dtype for materialization; foundation falls through. + let mut arena = Arena::new(); + let n = arena.add(AExpr::Literal(LiteralValue::Dyn(DynLiteralValue::Int(42)))); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// Sanity: a deeply nested predicate `(a == 1) AND ((b > 2) OR is_null(c))` + /// passes through end-to-end. + #[test] + fn integration_nested_predicate() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let a_eq_1 = binop(&mut arena, a, Operator::Eq, one); + + let b = col(&mut arena, "b"); + let two = lit_i32(&mut arena, 2); + let b_gt_2 = binop(&mut arena, b, Operator::Gt, two); + + let c = col(&mut arena, "c"); + let c_is_null = boolean_fn(&mut arena, IRBooleanFunction::IsNull, c); + + let inner_or = binop(&mut arena, b_gt_2, Operator::Or, c_is_null); + let root_node = binop(&mut arena, a_eq_1, Operator::And, inner_or); + + // Both Or and And fire the schema gate; provide schema with a/b/c Int32. + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::Int32); + schema.with_column(PlSmallStr::from("b"), DataType::Int32); + schema.with_column(PlSmallStr::from("c"), DataType::Int32); + assert!(aexpr_to_vortex_expression(root_node, &arena, Some(&schema)).is_some()); + } + + /// Sanity: a single still-unsupported sub-shape poisons the whole tree. + /// (Updated for PR-2.2: Plus is now supported, so use Minus to exercise the + /// poison path.) + #[test] + fn integration_unsupported_subexpr_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let a_minus_1 = binop(&mut arena, a, Operator::Minus, one); + let five = lit_i32(&mut arena, 5); + let n = binop(&mut arena, a_minus_1, Operator::Eq, five); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + // === PR-2.7 amend tests: cutover-lost pushdown shapes === + // + // Coverage: is_between (positive + Left/Right/None variants + cross-PType + // gate negative + structural shape_is_between_structural), is_in (positive + // arena tests deferred — arena-level Series-literal construction requires + // spinning up a polars-core Series + AnyValue::List + DataType::List shell; + // coverage stays at the e2e Python layer via test_scan_with_is_in_filter + // and the Phase 2 phase-end spec-lens Deferred entry "is_in arena-level + // structural test"), starts_with/ends_with/contains{literal:true|false} + // (positive + negative + structural variants), Ternary (positive + + // cross-dtype gate negative + nested-matching-dtypes via PR-2.7 cycle-2 + // resolve_inner_dtype Ternary arm + structural). + + #[cfg(feature = "is_between")] + fn schema_x_int32() -> Schema { + let mut s = Schema::default(); + s.with_column(PlSmallStr::from("x"), DataType::Int32); + s + } + + /// is_between with `ClosedInterval::Both` — `(col >= lo) AND (col <= hi)`. + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_both_inclusive() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Both, + }), + options: FunctionOptions::default(), + }); + let schema = schema_x_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// is_between with `ClosedInterval::Left` — `(col >= lo) AND (col < hi)`. + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_left_inclusive() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Left, + }), + options: FunctionOptions::default(), + }); + let schema = schema_x_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// is_between with `ClosedInterval::Right` — `(col > lo) AND (col <= hi)`. + /// Closes ClosedInterval coverage gap (PR-2.7 cycle-1 should-fix #6d). + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_right_inclusive() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Right, + }), + options: FunctionOptions::default(), + }); + let schema = schema_x_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// is_between with `ClosedInterval::None` — `(col > lo) AND (col < hi)`. + /// Closes ClosedInterval coverage gap (PR-2.7 cycle-1 should-fix #6d). + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_none_exclusive() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::None, + }), + options: FunctionOptions::default(), + }); + let schema = schema_x_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// is_between on cross-PType bounds (Int32 col, Int64 bounds) refuses pushdown + /// per the explicit pairwise-PType gate (PR-2.7 cycle-1 must-fix #1). Without + /// the gate, Vortex's `Binary::return_dtype` would `vortex_bail!` at scan-time. + /// Mirrors the existing `shape_eq_cross_ptype_returns_none` discipline. + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_cross_ptype_returns_none() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); // Int32 per schema_x_int32 + let lo = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(10), + )))); + let hi = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(20), + )))); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Both, + }), + options: FunctionOptions::default(), + }); + let schema = schema_x_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// is_between without schema → conservative refuse (the explicit gate at the + /// arm's top consults `schema?`). + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_without_schema_returns_none() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Both, + }), + options: FunctionOptions::default(), + }); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + #[cfg(feature = "strings")] + fn lit_str(arena: &mut Arena, value: &str) -> Node { + let scalar = Scalar::new( + DataType::String, + AnyValue::StringOwned(PlSmallStr::from(value)), + ); + arena.add(AExpr::Literal(LiteralValue::Scalar(scalar))) + } + + /// String column schema for the StringExpr tests (PR-2.7 cycle-1 must-fix #3 + /// adds a Utf8 input gate that requires schema to resolve `col`'s dtype). + #[cfg(feature = "strings")] + fn schema_s_string() -> Schema { + let mut s = Schema::default(); + s.with_column(PlSmallStr::from("s"), DataType::String); + s + } + + /// Schema with both a String column `s` and an Int32 column `i`, for the + /// StringExpr non-String-column refusal test. + #[cfg(feature = "strings")] + fn schema_s_string_i_int32() -> Schema { + let mut s = Schema::default(); + s.with_column(PlSmallStr::from("s"), DataType::String); + s.with_column(PlSmallStr::from("i"), DataType::Int32); + s + } + + #[cfg(feature = "strings")] + fn string_function( + arena: &mut Arena, + str_fn: IRStringFunction, + col_node: Node, + needle: Node, + ) -> Node { + let col_ir = ExprIR::new(col_node, OutputName::Alias(PlSmallStr::EMPTY)); + let needle_ir = ExprIR::new(needle, OutputName::Alias(PlSmallStr::EMPTY)); + arena.add(AExpr::Function { + input: vec![col_ir, needle_ir], + function: IRFunctionExpr::StringExpr(str_fn), + options: FunctionOptions::default(), + }) + } + + /// starts_with positive — `col.str.starts_with("prefix")` → `like(col, "prefix%")`. + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_positive() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "prefix"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// starts_with with a `%` in the needle — refuse (would change LIKE semantics). + /// Same refuse logic for `_` and `\` (covered by `bytes_to_like_literal`). + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_wildcard_in_needle_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "100%"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// starts_with with `_` in the needle — refuse (same `bytes_to_like_literal` + /// guard). Closes the cycle-1 should-fix #6e and nit #11 coverage gap. + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_underscore_in_needle_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "foo_bar"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// starts_with with `\` (backslash) in the needle — refuse (LIKE's escape + /// character; same `bytes_to_like_literal` guard). + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_backslash_in_needle_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "foo\\bar"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// starts_with on a non-String column (Int32) — refuses pushdown per the + /// explicit Utf8 input gate (PR-2.7 cycle-1 must-fix #3). Without the gate, + /// Vortex's `Like::return_dtype` would `vortex_bail!` at scan-time. + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_on_int_column_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "i"); // Int32 per schema_s_string_i_int32 + let needle = lit_str(&mut arena, "prefix"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + let schema = schema_s_string_i_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// StringExpr without schema → conservative refuse (the Utf8 gate consults + /// `schema?`). + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_without_schema_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "prefix"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + /// ends_with positive — `col.str.ends_with("suffix")` → `like(col, "%suffix")`. + #[cfg(feature = "strings")] + #[test] + fn shape_ends_with_positive() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "suffix"); + let n = string_function(&mut arena, IRStringFunction::EndsWith, c, needle); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// contains{literal:true} positive — `col.str.contains("sub", literal=True)` + /// → `like(col, "%sub%")`. + #[cfg(all(feature = "strings", feature = "regex"))] + #[test] + fn shape_contains_literal_true_positive() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "substr"); + let n = string_function( + &mut arena, + IRStringFunction::Contains { + literal: true, + strict: false, + }, + c, + needle, + ); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// contains{literal:false} — regex pattern; Vortex LIKE doesn't support regex; + /// refuse. Preserves the always-SAFE-fallback contract: the residual filter + /// applies the regex post-decode. + #[cfg(all(feature = "strings", feature = "regex"))] + #[test] + fn shape_contains_literal_false_returns_none() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, ".*"); + let n = string_function( + &mut arena, + IRStringFunction::Contains { + literal: false, + strict: false, + }, + c, + needle, + ); + let schema = schema_s_string(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Ternary positive — `when(a == 1).then(a).otherwise(b)` → + /// `case_when(eq(col_a, lit(1)), col_a, col_b)`. All three subtrees push down, + /// so the Ternary as a whole pushes down. + #[test] + fn shape_ternary_positive() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let predicate = binop(&mut arena, a, Operator::Eq, one); + let a2 = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let n = arena.add(AExpr::Ternary { + predicate, + truthy: a2, + falsy: b, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + /// Ternary with an unsupported subtree (Minus arithmetic) — fail-closed. + #[test] + fn shape_ternary_unsupported_subtree_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let a_minus_1 = binop(&mut arena, a, Operator::Minus, one); + let predicate = binop(&mut arena, a, Operator::Eq, one); + let b = col(&mut arena, "b"); + let n = arena.add(AExpr::Ternary { + predicate, + truthy: a_minus_1, + falsy: b, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Ternary with cross-dtype truthy/falsy (truthy=Int32 col `a`, falsy=Int64 + /// literal) — refuses pushdown per the explicit THEN/ELSE pairwise-dtype gate + /// (PR-2.7 cycle-1 must-fix #2). Without the gate, Vortex's + /// `CaseWhen::return_dtype` would `vortex_bail!` at scan-time. Mirrors the + /// `shape_eq_cross_ptype_returns_none` discipline. + #[test] + fn shape_ternary_cross_dtype_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); // Int32 per schema_a_b_int32 + let one_i32 = lit_i32(&mut arena, 1); + let predicate = binop(&mut arena, a, Operator::Eq, one_i32); + let a2 = col(&mut arena, "a"); // Int32 + let int64_lit = arena.add(AExpr::Literal(LiteralValue::Scalar(Scalar::new( + DataType::Int64, + AnyValue::Int64(7), + )))); + let n = arena.add(AExpr::Ternary { + predicate, + truthy: a2, + falsy: int64_lit, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_none()); + } + + /// Ternary without schema → conservative refuse (the explicit gate at the + /// arm's top consults `schema?`). + #[test] + fn shape_ternary_without_schema_returns_none() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let predicate = binop(&mut arena, a, Operator::Eq, one); + let a2 = col(&mut arena, "a"); + let b = col(&mut arena, "b"); + let n = arena.add(AExpr::Ternary { + predicate, + truthy: a2, + falsy: b, + }); + assert!(aexpr_to_vortex_expression(n, &arena, None).is_none()); + } + + // === PR-2.8: per-column file-vs-virtual minterm split helper === + + /// `aexpr_file_minterms_to_vortex_expression` positive — predicate with all + /// file-only minterms (no virtual_cols in any leaf) AND-collects all + /// conjuncts. Structural assertion (cycle-1 should-fix): asserts BOTH + /// literals appear in the produced expression, defending against a + /// regression where one minterm is dropped spuriously. Literals 42 / 99 + /// chosen as unique discriminators (no overlap with each other or with + /// connector keywords). + #[test] + fn minterms_all_file_only_collects_all() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let fortytwo = lit_i32(&mut arena, 42); + let eq_a = binop(&mut arena, a, Operator::Eq, fortytwo); + let b = col(&mut arena, "b"); + let ninetynine = lit_i32(&mut arena, 99); + let eq_b = binop(&mut arena, b, Operator::Eq, ninetynine); + let and_node = binop(&mut arena, eq_a, Operator::And, eq_b); + let schema = schema_a_b_int32(); + let virtual_cols: PlHashSet = PlHashSet::default(); + let expr = aexpr_file_minterms_to_vortex_expression( + and_node, + &arena, + Some(&schema), + &virtual_cols, + ) + .expect("expected file-only AND to push down"); + let s = format!("{}", expr); + // Both literal-bearing conjuncts must survive into the AND-collected + // result. A regression dropping one minterm would fail one of these. + assert!( + s.contains("42"), + "expected literal 42 (from a == 42) in {s}" + ); + assert!( + s.contains("99"), + "expected literal 99 (from b == 99) in {s}" + ); + } + + /// Predicate references only virtual cols → all minterms filtered out → None. + #[test] + fn minterms_all_virtual_returns_none() { + let mut arena = Arena::new(); + let year = col(&mut arena, "year"); + let twentyfour = lit_i32(&mut arena, 2024); + let n = binop(&mut arena, year, Operator::Eq, twentyfour); + // schema must contain `year` for the comparison gate (gate's + // resolve_inner_dtype call); without it the minterm wouldn't even + // convert. The point of this test: even WITH a schema that resolves + // the minterm, the minterm filter drops it as virtual. + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("year"), DataType::Int32); + let mut virtual_cols: PlHashSet = PlHashSet::default(); + virtual_cols.insert(PlSmallStr::from("year")); + let expr = + aexpr_file_minterms_to_vortex_expression(n, &arena, Some(&schema), &virtual_cols); + assert!(expr.is_none(), "expected virtual-only predicate to refuse"); + } + + /// Mixed predicate: `a == 42 AND year == 9999` with `year` virtual → + /// pushes only the `a == 42` minterm; virtual conjunct dropped. + /// Structural assertion (cycle-1 should-fix): positive anchor on the + /// kept-side literal `42` + negative anchor on the dropped-side literal + /// `9999`. Defends against (a) regressions where the helper ignores + /// `virtual_cols` entirely (pushes both → `9999` would appear), (b) + /// filter-polarity inversion (keeps virtual, drops file → `42` would be + /// absent and `9999` present). + #[test] + fn minterms_partial_pushes_file_part_only() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let fortytwo = lit_i32(&mut arena, 42); + let eq_a = binop(&mut arena, a, Operator::Eq, fortytwo); + let year = col(&mut arena, "year"); + let ninethousand = lit_i32(&mut arena, 9999); + let eq_year = binop(&mut arena, year, Operator::Eq, ninethousand); + let and_node = binop(&mut arena, eq_a, Operator::And, eq_year); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::Int32); + schema.with_column(PlSmallStr::from("year"), DataType::Int32); + let mut virtual_cols: PlHashSet = PlHashSet::default(); + virtual_cols.insert(PlSmallStr::from("year")); + let expr = aexpr_file_minterms_to_vortex_expression( + and_node, + &arena, + Some(&schema), + &virtual_cols, + ) + .expect("expected file-column conjunct (a == 42) to push despite virtual conjunct"); + let s = format!("{}", expr); + assert!( + s.contains("42"), + "expected kept-side literal 42 (from a == 42) in {s}" + ); + assert!( + !s.contains("9999"), + "unexpected dropped-side literal 9999 (from year == 9999) in {s} — \ + virtual conjunct leaked into pushdown?" + ); + } + + /// Top-level OR with one virtual operand → single minterm referencing both + /// → refuse pushdown (CAN'T split an OR partially without changing + /// semantics: pushing only one side would yield a NARROWER predicate + /// vs the OR — Vortex would drop rows that satisfy the other side). + #[test] + fn minterms_top_level_or_with_virtual_refuses() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let eq_a = binop(&mut arena, a, Operator::Eq, one); + let year = col(&mut arena, "year"); + let twentyfour = lit_i32(&mut arena, 2024); + let eq_year = binop(&mut arena, year, Operator::Eq, twentyfour); + let or_node = binop(&mut arena, eq_a, Operator::Or, eq_year); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("a"), DataType::Int32); + schema.with_column(PlSmallStr::from("year"), DataType::Int32); + let mut virtual_cols: PlHashSet = PlHashSet::default(); + virtual_cols.insert(PlSmallStr::from("year")); + let expr = + aexpr_file_minterms_to_vortex_expression(or_node, &arena, Some(&schema), &virtual_cols); + assert!( + expr.is_none(), + "expected top-level OR with virtual operand to refuse (single minterm references virtual)" + ); + } + + /// Unsupported subtree (Minus arithmetic) in one conjunct + supported in + /// the other → pushes only the supported conjunct. Demonstrates the + /// PARTIAL-conversion win even without virtual cols (an improvement over + /// the prior all-or-nothing `aexpr_to_vortex_expression` direct call). + /// Structural assertion (cycle-1 should-fix): positive anchor on the + /// kept-side literal `42` + negative anchors on the dropped-subtree + /// literals (`7` from `b - 7`, `888` from the eq RHS). Defends against + /// regressions where the helper somehow pushed the Minus-bearing minterm + /// (producing an invalid Vortex expression that would crash at execute + /// time — a unit-test crash is preferable to a runtime panic). + #[test] + fn minterms_unsupported_subtree_dropped_in_partial_push() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let fortytwo = lit_i32(&mut arena, 42); + let eq_a = binop(&mut arena, a, Operator::Eq, fortytwo); + let b = col(&mut arena, "b"); + let seven = lit_i32(&mut arena, 7); + let b_minus_7 = binop(&mut arena, b, Operator::Minus, seven); + let eighteighteight = lit_i32(&mut arena, 888); + let minus_eq = binop(&mut arena, b_minus_7, Operator::Eq, eighteighteight); + let and_node = binop(&mut arena, eq_a, Operator::And, minus_eq); + let schema = schema_a_b_int32(); + let virtual_cols: PlHashSet = PlHashSet::default(); + let expr = aexpr_file_minterms_to_vortex_expression( + and_node, + &arena, + Some(&schema), + &virtual_cols, + ) + .expect("expected supported conjunct (a == 42) to push down even with unsupported sibling"); + let s = format!("{}", expr); + assert!( + s.contains("42"), + "expected kept-side literal 42 (from a == 42) in {s}" + ); + assert!( + !s.contains("888"), + "unexpected dropped-side literal 888 (from (b - 7) == 888) in {s} — \ + unsupported Minus subtree leaked into pushdown?" + ); + } + + /// Cross-PR test (Phase 2 cycle-2 correctness lens): PR-2.7 shape + /// (`is_between`) inside a PR-2.8 minterm-split with a virtual col. Walks + /// the structurally-safe path: `aexpr_to_leaf_names_iter` descends through + /// `Function::Boolean(IsBetween)` (per traverse.rs's children_rev), finds + /// only `x` (file col, not in virtual_cols), so the is_between minterm is + /// kept. The `year == 9999` minterm references `year` (in virtual_cols) + /// and is dropped. Helper AND-collects the kept is_between minterm. + /// + /// Defends against a regression where leaf-walking through Function args + /// silently drops or mis-classifies leaves — would manifest as either + /// pushing too much (year leaks) or pushing nothing (is_between mis-dropped). + #[cfg(feature = "is_between")] + #[test] + fn minterms_is_between_with_virtual_col_pushes_is_between_only() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + // is_between(x, 10, 20, Both) — file-only minterm + let x = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let x_ir = ExprIR::new(x, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let is_between_node = arena.add(AExpr::Function { + input: vec![x_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Both, + }), + options: FunctionOptions::default(), + }); + // year == 9999 — virtual-only minterm (year is hive) + let year = col(&mut arena, "year"); + let ninethousand = lit_i32(&mut arena, 9999); + let eq_year = binop(&mut arena, year, Operator::Eq, ninethousand); + // Combine via top-level And + let and_node = binop(&mut arena, is_between_node, Operator::And, eq_year); + let mut schema = Schema::default(); + schema.with_column(PlSmallStr::from("x"), DataType::Int32); + schema.with_column(PlSmallStr::from("year"), DataType::Int32); + let mut virtual_cols: PlHashSet = PlHashSet::default(); + virtual_cols.insert(PlSmallStr::from("year")); + let expr = aexpr_file_minterms_to_vortex_expression( + and_node, + &arena, + Some(&schema), + &virtual_cols, + ) + .expect("expected is_between minterm to push despite virtual year sibling"); + let s = format!("{}", expr); + // is_between(x, 10, 20, Both) → `and(gt_eq(x, 10), lt_eq(x, 20))`. + // Positive anchors: both bounds present (10 + 20) → confirms is_between + // structure kept. Negative anchor: dropped-side literal 9999 absent. + assert!( + s.contains("10"), + "expected is_between lower bound 10 in {s}" + ); + assert!( + s.contains("20"), + "expected is_between upper bound 20 in {s}" + ); + assert!( + !s.contains("9999"), + "unexpected dropped-side literal 9999 (from year == 9999) in {s} — \ + virtual conjunct leaked into pushdown? (leaf-walk through is_between's \ + Function args must descend correctly via aexpr_to_leaf_names_iter)" + ); + } + + /// Nested Ternary with matching inner dtypes — pushes down via the + /// `resolve_inner_dtype` Ternary arm (PR-2.7 cycle-2 should-fix #1). Before + /// the arm was added, the outer Ternary's THEN/ELSE gate consulted + /// `resolve_inner_dtype(inner_ternary)?` which fell to the catchall `None`, + /// `?`-propagating to refuse pushdown unconditionally. With the recursive + /// arm, nested Ternary chains resolve correctly. Tree: + /// `when(a == 1).then(when(a > 0).then(a).otherwise(a)).otherwise(b)` + /// — outer truthy is `case_when(...)`, outer falsy is col `b` (Int32); + /// inner truthy/falsy both Int32 → inner resolves to Int32 → outer t==e → pushes. + #[test] + fn shape_ternary_nested_matching_dtypes_pushes() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); // Int32 + let zero = lit_i32(&mut arena, 0); + let one = lit_i32(&mut arena, 1); + let inner_pred = binop(&mut arena, a, Operator::Gt, zero); + let a2 = col(&mut arena, "a"); // Int32 + let a3 = col(&mut arena, "a"); // Int32 + let inner_ternary = arena.add(AExpr::Ternary { + predicate: inner_pred, + truthy: a2, + falsy: a3, + }); + let outer_pred = binop(&mut arena, a, Operator::Eq, one); + let b = col(&mut arena, "b"); // Int32 per schema_a_b_int32 + let n = arena.add(AExpr::Ternary { + predicate: outer_pred, + truthy: inner_ternary, + falsy: b, + }); + let schema = schema_a_b_int32(); + assert!(aexpr_to_vortex_expression(n, &arena, Some(&schema)).is_some()); + } + + // === PR-2.7 cycle-1 should-fix #4: engagement-assertion structural tests === + // + // Each test asserts the produced Vortex Expression's Display output contains + // recognizable structural anchors for the shape's expected builder. Catches + // paste-swap bugs (e.g., is_between → eq, starts_with → ends_with). Mirrors + // the `shape_plus_arithmetic_structural` discipline established for the Plus + // arm in PR-2.2 cycle-1 (the cycle-1 fresh reviewer's escalation that + // tautological `.is_some()` assertions miss paste-swap bugs). Closes + // acceptance criterion (c): "Each new shape engaged via display_tree() OR + // POLARS_VERBOSE assertion in at least one test." + + /// is_between structural — decomposed to `and(gt_eq, lt_eq)` for Both. A + /// paste-swap to `eq` or `or` would change the display string. + #[cfg(feature = "is_between")] + #[test] + fn shape_is_between_structural() { + use polars_ops::series::ClosedInterval; + let mut arena = Arena::new(); + let c = col(&mut arena, "x"); + let lo = lit_i32(&mut arena, 10); + let hi = lit_i32(&mut arena, 20); + let c_ir = ExprIR::new(c, OutputName::Alias(PlSmallStr::EMPTY)); + let lo_ir = ExprIR::new(lo, OutputName::Alias(PlSmallStr::EMPTY)); + let hi_ir = ExprIR::new(hi, OutputName::Alias(PlSmallStr::EMPTY)); + let n = arena.add(AExpr::Function { + input: vec![c_ir, lo_ir, hi_ir], + function: IRFunctionExpr::Boolean(IRBooleanFunction::IsBetween { + closed: ClosedInterval::Both, + }), + options: FunctionOptions::default(), + }); + let schema = schema_x_int32(); + let expr = aexpr_to_vortex_expression(n, &arena, Some(&schema)).expect("Some"); + let s = format!("{}", expr); + // Structural anchors: AND + lower-bound + upper-bound + column + bounds. + assert!( + s.contains("and") || s.contains("AND") || s.contains("&&"), + "expected and in {s}" + ); + assert!( + s.contains(">=") || s.contains("gt_eq"), + "expected >= or gt_eq in {s}" + ); + assert!( + s.contains("<=") || s.contains("lt_eq"), + "expected <= or lt_eq in {s}" + ); + assert!(s.contains('x'), "expected column 'x' in {s}"); + assert!(s.contains("10"), "expected literal 10 in {s}"); + assert!(s.contains("20"), "expected literal 20 in {s}"); + // Paste-swap anchor (PR-2.7 cycle-2 should-fix #3): an `eq` paste-swap + // would produce `(x = 10) AND (x = 20)` — no `>` / `<` chars. The + // correct form contains both. The cycle-1 negative anchor checked + // `!s.contains("eq(") && !s.contains("==")` which was vacuous: Vortex + // displays binary ops as single-char `=`, never `eq(` or `==`. + assert!( + s.contains('>'), + "expected '>' (in '>=') in {s} (paste-swap to eq?)" + ); + assert!( + s.contains('<'), + "expected '<' (in '<=') in {s} (paste-swap to eq?)" + ); + } + + /// starts_with structural — `like(col, lit("prefix%"))`. + /// + /// PR-2.7 cycle-2 should-fix #4 strengthens this test: assert the JOINT + /// substring `prefix%` (in order) AND the negative anchor `!"%prefix"`. The + /// cycle-1 version only checked `s.contains("prefix")` + `s.contains('%')` + /// independently, which a paste-swap to the ends_with branch (producing + /// `like(col, "%prefix")`) would silently pass. + #[cfg(feature = "strings")] + #[test] + fn shape_starts_with_structural() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "prefix"); + let n = string_function(&mut arena, IRStringFunction::StartsWith, c, needle); + let schema = schema_s_string(); + let expr = aexpr_to_vortex_expression(n, &arena, Some(&schema)).expect("Some"); + let s = format!("{}", expr); + assert!( + s.contains("like") || s.contains("LIKE"), + "expected like in {s}" + ); + assert!(s.contains('s'), "expected column 's' in {s}"); + // Joint-substring positive anchor (PR-2.7 cycle-2 should-fix #4): + // correct starts_with emits `prefix%` (% AFTER needle). + assert!( + s.contains("prefix%"), + "expected joint 'prefix%' (% after needle) in {s} \ + (paste-swap to ends_with would emit '%prefix' instead)" + ); + // Negative anchor: ends_with's `%prefix` pattern must be absent. + assert!( + !s.contains("%prefix"), + "unexpected '%prefix' (% before needle) in {s} \ + (paste-swap to ends_with branch?)" + ); + } + + /// ends_with structural — `like(col, lit("%suffix"))`. Same joint+negative + /// discipline as `shape_starts_with_structural` (PR-2.7 cycle-2 should-fix #4). + #[cfg(feature = "strings")] + #[test] + fn shape_ends_with_structural() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "suffix"); + let n = string_function(&mut arena, IRStringFunction::EndsWith, c, needle); + let schema = schema_s_string(); + let expr = aexpr_to_vortex_expression(n, &arena, Some(&schema)).expect("Some"); + let s = format!("{}", expr); + assert!( + s.contains("like") || s.contains("LIKE"), + "expected like in {s}" + ); + // Joint-substring positive anchor: correct ends_with emits `%suffix`. + assert!( + s.contains("%suffix"), + "expected joint '%suffix' (% before needle) in {s} \ + (paste-swap to starts_with would emit 'suffix%' instead)" + ); + // Negative anchor: starts_with's `suffix%` pattern must be absent. + assert!( + !s.contains("suffix%"), + "unexpected 'suffix%' (% after needle) in {s} \ + (paste-swap to starts_with branch?)" + ); + } + + /// contains{literal:true} structural — `like(col, lit("%sub%"))`. + #[cfg(all(feature = "strings", feature = "regex"))] + #[test] + fn shape_contains_literal_true_structural() { + let mut arena = Arena::new(); + let c = col(&mut arena, "s"); + let needle = lit_str(&mut arena, "sub"); + let n = string_function( + &mut arena, + IRStringFunction::Contains { + literal: true, + strict: false, + }, + c, + needle, + ); + let schema = schema_s_string(); + let expr = aexpr_to_vortex_expression(n, &arena, Some(&schema)).expect("Some"); + let s = format!("{}", expr); + assert!( + s.contains("like") || s.contains("LIKE"), + "expected like in {s}" + ); + assert!(s.contains("sub"), "expected 'sub' in {s}"); + // Both '%' wildcards present (prefix + suffix). + assert!( + s.matches('%').count() >= 2, + "expected two '%' wildcards in {s}" + ); + } + + /// Ternary structural — `case_when(condition, then, else)`. A paste-swap to + /// `if` or wrong builder would change the display anchor. + /// + /// PR-2.7 cycle-2 should-fix #5 strengthens this test: use distinct integer + /// literals (777 truthy, 888 falsy) so positional ordering becomes + /// asserting. The cycle-1 version used col 'a' (truthy) + col 'b' (falsy) + /// which a paste-swap to `case_when(cond, b, a)` would silently pass (both + /// columns remain present in the display). + #[test] + fn shape_ternary_structural() { + let mut arena = Arena::new(); + let a = col(&mut arena, "a"); + let one = lit_i32(&mut arena, 1); + let predicate = binop(&mut arena, a, Operator::Eq, one); + // Distinct integer literals discriminate paste-swap order. + let truthy = lit_i32(&mut arena, 777); + let falsy = lit_i32(&mut arena, 888); + let n = arena.add(AExpr::Ternary { + predicate, + truthy, + falsy, + }); + let schema = schema_a_b_int32(); + let expr = aexpr_to_vortex_expression(n, &arena, Some(&schema)).expect("Some"); + let s = format!("{}", expr); + // Vortex's case_when builder produces a "case" or "case_when" anchor. + assert!( + s.contains("case") || s.contains("CASE") || s.contains("when"), + "expected case/case_when in {s}" + ); + assert!(s.contains("777"), "expected truthy literal 777 in {s}"); + assert!(s.contains("888"), "expected falsy literal 888 in {s}"); + // Positional-ordering anchor: in a correct case_when(cond, 777, 888), + // the truthy 777 appears BEFORE the falsy 888 in the display string. + let t_pos = s.find("777").expect("777 absent"); + let f_pos = s.find("888").expect("888 absent"); + assert!( + t_pos < f_pos, + "expected truthy 777 BEFORE falsy 888 in {s} \ + (paste-swap to case_when(cond, falsy, truthy)?)" + ); + } +} diff --git a/crates/polars-stream/src/nodes/io_sources/vortex/builder.rs b/crates/polars-stream/src/nodes/io_sources/vortex/builder.rs index f3f1d1665ac4..1777a6df845f 100644 --- a/crates/polars-stream/src/nodes/io_sources/vortex/builder.rs +++ b/crates/polars-stream/src/nodes/io_sources/vortex/builder.rs @@ -6,6 +6,7 @@ use polars_io::cloud::CloudOptions; use polars_io::metrics::IOMetrics; use polars_plan::dsl::ScanSource; use polars_vortex::read::VortexSegmentCacheRef; +use polars_vortex::vortex::expr::Expression as VortexExpression; use polars_vortex::{VortexScanOptions, vortex}; use super::VortexFileReader; @@ -23,6 +24,17 @@ pub struct VortexReaderBuilder { /// streaming source falls back to `options.segment_cache.resolve()` (e.g., user-supplied /// schema path where no postscript read happened at IR-build). pub segment_cache: Option, + /// AExpr-direct convertor result (PR-13.2, sole pushdown path as of PR-2.6): when + /// the predicate translated cleanly via + /// `polars_plan::plans::predicates::vortex_convertor::aexpr_to_vortex_expression` + /// (the `aexpr` module is `pub(crate)`; the externally-resolvable path goes through + /// the `pub use aexpr::*` re-export at `plans/mod.rs`), + /// the Vortex `Expression` is captured here at IR-build time (where we still have + /// `expr_arena` access). `VortexFileReader::begin_read` uses it directly. The + /// multi-scan layer reapplies the full predicate post-decode regardless (we + /// advertise `PARTIAL_FILTER`), so it is safe to push only a subset; shapes the + /// convertor returns `None` for fall through to no-pushdown + post-decode reapply. + pub aexpr_filter: Option, pub io_metrics: std::sync::OnceLock>, } @@ -42,9 +54,12 @@ impl FileReaderBuilder for VortexReaderBuilder { fn reader_capabilities(&self) -> ReaderCapabilities { use ReaderCapabilities as RC; // The multi-scan layer reapplies the full predicate post-decode, so PARTIAL_FILTER - // is always safe; FULL_FILTER would require the convertor to consume every shape - // it sees (out-of-scope while we lean on `SpecializedColumnPredicate`). - // EXTERNAL_FILTER_MASK would need Vortex `Selection` bitmap plumbing. + // is always safe; FULL_FILTER would require the AExpr-direct convertor to be a + // strict superset of every AExpr predicate Polars constructs (still out of scope: + // the convertor returns None for unhandled shapes — Sort/Gather/Filter/Agg/Ternary/ + // AnonymousFunction/Over/Rolling/temporal extracts/etc. — and the multi-scan + // reapply handles them). EXTERNAL_FILTER_MASK would need Vortex `Selection` bitmap + // plumbing. RC::ROW_INDEX | RC::PRE_SLICE | RC::NEGATIVE_PRE_SLICE @@ -78,6 +93,11 @@ impl FileReaderBuilder for VortexReaderBuilder { // Threaded resolved cache for the data read; `None` triggers fallback resolve() // inside `VortexFileReader::initialize`. Same pattern as `footer` above. segment_cache: self.segment_cache.clone(), + // AExpr-direct convertor result (sole pushdown path as of PR-2.6). Shared + // across all sources in a multi-source scan — the convertor result is + // purely a function of the (predicate, schema) pair, both of which are + // constant across the scan's sources. + aexpr_filter: self.aexpr_filter.clone(), io_metrics: OptIOMetrics(self.io_metrics.get().cloned()), init_data: None, }) as _ diff --git a/crates/polars-stream/src/nodes/io_sources/vortex/mod.rs b/crates/polars-stream/src/nodes/io_sources/vortex/mod.rs index 92e62b70b5a9..58b6dfe3e8d1 100644 --- a/crates/polars-stream/src/nodes/io_sources/vortex/mod.rs +++ b/crates/polars-stream/src/nodes/io_sources/vortex/mod.rs @@ -20,7 +20,6 @@ use polars_utils::slice_enum::Slice; use polars_vortex::read::array_bridge::{ ArrowUpstreamSchema, arrow_dtypes_from_schema, record_batch_to_dataframe, }; -use polars_vortex::read::predicate::polars_to_vortex_predicate; use polars_vortex::read::read_at::local_file_read_at; use polars_vortex::read::schema::vortex_dtype_to_schema; use polars_vortex::session::session; @@ -51,6 +50,14 @@ pub struct VortexFileReader { /// read. When `None` (user-supplied-schema path; no IR-build postscript read happened), /// `initialize()` falls back to `options.segment_cache.resolve()` for a fresh cache. pub segment_cache: Option, + /// AExpr-direct convertor result threaded from IR-build (`FileScanIR::Vortex` → + /// `VortexReaderBuilder::aexpr_filter` → here). When `Some`, `begin_read` uses this + /// Vortex `Expression` directly. PR-2.6 (Option B → A cutover) deleted the legacy + /// `polars_to_vortex_predicate` fallback — the AExpr-direct convertor is now the + /// sole filter-pushdown path. When `None` (unhandled AExpr shape, or + /// `push_predicate=false`), no filter pushes down; the multi-scan layer reapplies + /// the predicate post-decode (we advertise `PARTIAL_FILTER`). + pub aexpr_filter: Option, pub io_metrics: OptIOMetrics, /// Set by `initialize()`. @@ -248,12 +255,18 @@ impl FileReader for VortexFileReader { } }); - // Translate the pushable bits of args.predicate into a Vortex `Expression`. We - // advertise `PARTIAL_FILTER` capability, so the multi-scan layer keeps the - // original predicate around to apply post-decode — pushing only what we can - // convert is safe (over-conservative pushdown would drop rows incorrectly). + // Use the AExpr-direct convertor result computed at IR-build time + // (`physical_plan::lower_ir`). We advertise `PARTIAL_FILTER` capability so the + // multi-scan layer keeps the original predicate around to apply post-decode — + // pushing only what we can convert is safe. + // + // PR-2.6 (Option B → A cutover) deleted the legacy `polars_to_vortex_predicate` + // (`SpecializedColumnPredicate`-derived) fallback path; the AExpr-direct + // convertor is now the sole filter-pushdown path. When `aexpr_filter` is `None` + // (unhandled AExpr shape, or `push_predicate=false`), no filter pushes down and + // the multi-scan reapply handles correctness. let filter_expr = if self.options.push_predicate { - args.predicate.as_ref().and_then(polars_to_vortex_predicate) + self.aexpr_filter.clone() } else { None }; diff --git a/crates/polars-stream/src/physical_plan/lower_ir.rs b/crates/polars-stream/src/physical_plan/lower_ir.rs index ef5b77404666..82f75c344660 100644 --- a/crates/polars-stream/src/physical_plan/lower_ir.rs +++ b/crates/polars-stream/src/physical_plan/lower_ir.rs @@ -20,6 +20,8 @@ use polars_plan::dsl::default_values::DefaultFieldValues; use polars_plan::dsl::deletion::DeletionFilesList; use polars_plan::dsl::{CallbackSinkType, ExtraColumnsPolicy, FileScanIR, SinkTypeIR}; use polars_plan::plans::expr_ir::{ExprIR, OutputName}; +#[cfg(feature = "vortex")] +use polars_plan::plans::predicates::vortex_convertor; use polars_plan::plans::{AExpr, FunctionIR, IR, IRAggExpr, LiteralValue, write_ir_non_recursive}; use polars_plan::prelude::*; use polars_utils::arena::{Arena, Node}; @@ -767,17 +769,76 @@ pub fn lower_ir( options, metadata: first_metadata, segment_cache, - } => Arc::new( - crate::nodes::io_sources::vortex::builder::VortexReaderBuilder { - options: Arc::new(options.clone()), - first_metadata: first_metadata.clone(), - // Threaded from IR-build (scans.rs::vortex_file_info caller); when - // None (schema-supplied path), the streaming source resolves a - // fresh cache from `options.segment_cache`. - segment_cache: segment_cache.clone(), - io_metrics: std::sync::OnceLock::new(), - }, - ) as _, + } => { + // PR-13.2 AExpr-direct pushdown (sole pushdown path as of PR-2.6): + // when a predicate exists and `push_predicate` is on, try the + // convertor. The result rides on the builder; `VortexFileReader:: + // begin_read` uses it directly. We are still at IR-build time + // here, so we have `expr_arena` access — which we lose by the + // time `begin_read` runs (where only `Arc` + // survives). When the convertor returns `None` (unhandled AExpr + // shape OR all minterms touched virtual columns), `aexpr_filter` + // stays `None` and no filter pushes down; the multi-scan layer + // reapplies the predicate post-decode via the `PARTIAL_FILTER` + // capability so correctness is preserved. + // + // Per-column file-vs-virtual split (PR-2.8, replaces PR-2.2 + // cycle-1 must-fix M1's all-or-nothing guard): `file_info.schema` + // "Does not include logical columns like `include_file_path` and + // row index" but DOES include hive columns (plans/schema.rs:42-43). + // The convertor's `AExpr::Column` arm emits a bare `get_item(name, + // root())` with no schema-membership check, so a predicate + // referencing a hive column / row_index / include_file_paths would + // emit a Vortex reference to a column not in the file's data — + // Vortex would bail at `into_array_stream`. + // + // Build a `virtual_cols` set from hive partition column names + // + row_index name + include_file_paths name. Pass it to + // `vortex_convertor::aexpr_file_minterms_to_vortex_expression`, + // which walks top-level conjuncts via `MintermIter` and converts + // only the file-only minterms; virtual-column-touching minterms + // are left for the multi-scan reapply. Mirrors + // `polars-mem-engine/scan_predicate/functions::create_scan_predicate`'s + // `hive_predicate` extraction. Resolves Deferred work entry + // "Virtual-column-partitioned Vortex scans don't benefit from + // AExpr convertor pushdown". + let aexpr_filter = if options.push_predicate { + predicate.as_ref().and_then(|p| { + let mut virtual_cols: PlHashSet = PlHashSet::default(); + if let Some(hp) = hive_parts.as_ref() { + for name in hp.schema().iter_names() { + virtual_cols.insert(name.clone()); + } + } + if let Some(ri) = unified_scan_args.row_index.as_ref() { + virtual_cols.insert(ri.name.clone()); + } + if let Some(ifp) = unified_scan_args.include_file_paths.as_ref() { + virtual_cols.insert(ifp.clone()); + } + vortex_convertor::aexpr_file_minterms_to_vortex_expression( + p.node(), + expr_arena, + Some(file_info.schema.as_ref()), + &virtual_cols, + ) + }) + } else { + None + }; + Arc::new( + crate::nodes::io_sources::vortex::builder::VortexReaderBuilder { + options: Arc::new(options.clone()), + first_metadata: first_metadata.clone(), + // Threaded from IR-build (scans.rs::vortex_file_info caller); when + // None (schema-supplied path), the streaming source resolves a + // fresh cache from `options.segment_cache`. + segment_cache: segment_cache.clone(), + aexpr_filter, + io_metrics: std::sync::OnceLock::new(), + }, + ) as _ + }, #[cfg(feature = "csv")] FileScanIR::Csv { options } => { diff --git a/crates/polars-vortex/README.md b/crates/polars-vortex/README.md index 36edfffd8d30..69ded9f5812a 100644 --- a/crates/polars-vortex/README.md +++ b/crates/polars-vortex/README.md @@ -43,10 +43,17 @@ uses _all_ of these properties — not just "Vortex as another Parquet". What this looks like in practice: -- **Filter pushdown**: predicates that the optimizer extracts as `SpecializedColumnPredicate` (Equal - / Between / EqualOneOf / StartsWith / EndsWith) are translated to Vortex `Expression`s and handed - to `ScanBuilder::with_filter`. Inside Vortex, `LayoutReader::pruning_evaluation` consults per-zone +- **Filter pushdown**: an AExpr-direct convertor walks the predicate's `Arena` and emits + Vortex `Expression`s for the shapes Vortex can represent — column references, scalar literals, the + six comparison operators (`Eq`/`NotEq`/`Lt`/`LtEq`/`Gt`/`GtEq`), boolean combinators + (`And`/`Or`/`Not`), null checks (`IsNull`/`IsNotNull`), numeric addition (`Plus → checked_add`), + same-kind `CAST` (`Primitive↔Primitive` / `Bool↔Bool` / `Utf8↔Utf8` under `Strict` options), and + struct field access (`col.struct.field("inner")`). The Expression is handed to + `ScanBuilder::with_filter`; inside Vortex, `LayoutReader::pruning_evaluation` consults per-zone statistics and skips chunks that can't satisfy the predicate — _without decompressing them_. + Multiple type-safety gates (bitwise-vs-logical, numeric-only Plus, kind-compatible CAST, + same-PType comparison) refuse pushdown for shapes Vortex would scan-time-error on; the multi-scan + layer always re-applies the full predicate post-decode so partial pushdown is always _safe_. - **Negative slice pushdown**: `lf.tail(N)` becomes `ScanBuilder::with_row_range(...)`, not "decode everything and take the last N". The file's row count comes from the footer (free). - **Segment cache reuse**: second and subsequent scans of the same file skip decompression for any @@ -73,7 +80,10 @@ LazyFrame::scan_vortex(path, args) │ with IOMetrics + with_concurrency_budget) └─ begin_read(args): projection → vortex::expr::pack(get_item(...)) - predicate → polars_to_vortex_predicate (ColumnPredicates) + predicate → builder.aexpr_filter (AExpr-direct convertor result + computed at IR-build time in lower_ir.rs; PR-2.6 + cutover deleted the legacy + SpecializedColumnPredicate path) pre_slice → ScanBuilder::with_row_range .into_array_stream() → Stream for each chunk: @@ -141,28 +151,44 @@ impl) → morsels are converted to Vortex `ArrayRef`s by the reverse C-ABI bridg so all `CloudOptions` semantics (auth, retry, region overrides, credential providers) are honored — the same way Parquet's cloud reads work. No buffer copy. -4. **Filter pushdown via `SpecializedColumnPredicate`** ([`read/predicate.rs`]) — the Polars - optimizer already extracts single-column predicates into structured form: - `ColumnPredicates::predicates: PlHashMap)>`. - We pattern-match on the specialized variant and emit Vortex `Expression`s: - - | Polars `SpecializedColumnPredicate` | Vortex `Expression` | - | ----------------------------------- | -------------------------------------------------- | - | `Equal(scalar)` | `eq(get_item(col, root()), lit(scalar))` | - | `Between(lo, hi)` | `and(gt_eq(col, lo), lt_eq(col, hi))` | - | `EqualOneOf(scalars)` | `or_collect` of `eq(col, lit(s))` (all-or-nothing) | - | `StartsWith(bytes)` | `like(col, lit("prefix%"))` (wildcard-safe) | - | `EndsWith(bytes)` | `like(col, lit("%suffix"))` (wildcard-safe) | - | `RegexMatch(...)` | not pushed (Vortex `like` doesn't do regex) | - - `EqualOneOf` requires _every_ scalar in the IN-list to convert successfully — otherwise we leave - the whole predicate as residual. This prevents accidentally pushing a narrower predicate than the - user wrote. - - Anything we can't push (multi-column predicates, arithmetic, regex, struct field access, etc.) - stays as a residual filter. The reader advertises `ReaderCapabilities::PARTIAL_FILTER`, so - Polars' multi-scan layer always re-applies the original full predicate post-decode. Result: - pushdown is always _safe_, just sometimes _partial_. +4. **AExpr-direct filter pushdown** ( + [`polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs`](../polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs)) + — walks the predicate's `Arena` at IR-build time in + [`polars-stream/src/physical_plan/lower_ir.rs`](../polars-stream/src/physical_plan/lower_ir.rs) + (the `FileScanIR::Vortex` arm). The resulting Vortex `Expression` (or `None`) is attached to + `VortexReaderBuilder.aexpr_filter` and consumed by `VortexFileReader::begin_read` → + `ScanBuilder::with_filter`. Coverage: + + | Polars `AExpr` shape | Vortex `Expression` | Gates | + | ------------------------------------------------- | --------------------------------------- | ------------------------------------ | + | `Column(name)` | `get_item(name, root())` | virtual-col drop in helper (PR-2.8) | + | `Literal(Scalar)` | `lit(polars_scalar_to_vortex(...))` | none | + | comparisons (Eq/NotEq/Lt/LtEq/Gt/GtEq) | `eq`/`not_eq`/`lt`/`lt_eq`/`gt`/`gt_eq` | pairwise-equal-PType + schema | + | logical AND/OR (And/Or/LogicalAnd/LogicalOr) | `and`/`or` | bitwise-vs-logical schema gate | + | numeric addition (Plus) | `checked_add` | numeric + pairwise-equal-PType | + | same-kind CAST (Strict only) | `cast(child, vortex_dtype)` | source-kind + Strict-options gates | + | struct field access (`col.struct.field("inner")`) | `get_item(field_name, struct_expr)` | schema-membership gate | + | `IsNull` / `IsNotNull` | `is_null` / `is_not_null` | none | + | `Not` | `not` | boolean-only schema gate | + | `is_between(lo, hi, closed)` (PR-2.7) | `and(gt/gt_eq, lt/lt_eq)` | pairwise-PType (col vs lo/hi) | + | `is_in([scalars])` (PR-2.7) | `or_collect(eq(col, s_i) for s_i)` | refuse on nulls_equal + had_nulls | + | `str.starts_with(prefix)` (PR-2.7) | `like(col, lit("prefix%"))` | Utf8 input + `bytes_to_like_literal` | + | `str.ends_with(suffix)` (PR-2.7) | `like(col, lit("%suffix"))` | Utf8 input + `bytes_to_like_literal` | + | `str.contains(needle, literal=True)` (PR-2.7) | `like(col, lit("%needle%"))` | Utf8 input + `bytes_to_like_literal` | + | `Ternary(p, t, f)` (PR-2.7) | `case_when([(p, t)], Some(f))` | THEN/ELSE pairwise-dtype + schema | + + Pushdown is refused (returns `None` → residual) for unhandled shapes (Sort, Gather, Filter, Agg, + AnonymousFunction, Over, Rolling, temporal extracts, `str.contains(literal=False)` regex, etc.), + for non-Strict `CastOptions`, for cross-kind CAST, and for cross-PType arithmetic/comparison. For + predicates referencing hive partition columns or virtual columns (row_index, include_file_paths), + PR-2.8's per-minterm split (`aexpr_file_minterms_to_vortex_expression`) drops minterms whose + leaves include any virtual col and AND-collects the file-only minterms — strictly better than the + prior all-or-nothing refuse (e.g., `(file_col > 5) & (year == 2024)` with `year` hive now pushes + the `file_col > 5` part). The reader advertises `ReaderCapabilities::PARTIAL_FILTER`, so Polars' + multi-scan layer always re-applies the original full predicate post-decode. Result: pushdown is + always _safe_, just sometimes _partial_. Historical note: PR-2.6 deleted the previous + `SpecializedColumnPredicate`-derived path (a parallel fast path during PR-13.1–.5); PR-2.7 + + PR-2.8 then ported the remaining shapes back into the AExpr-direct convertor. ## Cargo features @@ -209,24 +235,32 @@ Standard Polars `storage_options=` — credentials, retry config, endpoint overr ## Pushdown coverage at a glance -| Pushdown | Status | Path | -| ------------------------------------------------------ | ------------------------------- | -------------------------------------------------------------------------------------- | -| Projection (column subset) | ✅ | `polars Projection` → `vortex::expr::pack(get_item(...))` | -| Slice (positive) | ✅ | `Slice::Positive` → `ScanBuilder::with_row_range` | -| Slice (negative, e.g. `.tail(N)`) | ✅ | `restrict_to_bounds(row_count)` (footer-cached row count) | -| Filter — `==`, `Between`, `is_in` | ✅ | `SpecializedColumnPredicate` → Vortex `Expression` | -| Filter — `starts_with`, `ends_with` | ✅ | LIKE pattern (wildcard-safe) | -| Filter — temporal (`Date`, `Datetime`, `Time`) scalars | ✅ (when `dtype-*` features on) | Vortex Date/Time/Timestamp extension scalars | -| Filter — `Decimal` scalars | ✅ (when `dtype-decimal` on) | Vortex `DecimalValue::I128` | -| Filter — `Duration` scalars | ❌ residual | Vortex has no Duration extension dtype yet | -| Filter — regex | ❌ residual | Vortex `like` doesn't do regex | -| Filter — arithmetic (`col + 1 > 5`) | ❌ residual | AExpr traversal not yet wired (PR-13) | -| Filter — `CAST(col, ...)` | ❌ residual | AExpr traversal (PR-13) | -| Filter — struct field access | ❌ residual | AExpr traversal (PR-13) | -| Zone-level pruning | ✅ | Vortex's `LayoutReader::pruning_evaluation` when given any filter | -| Hive partitioning | ✅ (free) | `UnifiedScanArgs::hive_options` | -| Schema evolution | ✅ (free) | `UnifiedScanArgs::{cast_columns_policy, missing_columns_policy, extra_columns_policy}` | -| Row index | ✅ (free) | `UnifiedScanArgs::row_index` (attached post-decode) | +| Pushdown | Status | Path | +| --------------------------------------------------------------- | ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| Projection (column subset) | ✅ | `polars Projection` → `vortex::expr::pack(get_item(...))` | +| Slice (positive) | ✅ | `Slice::Positive` → `ScanBuilder::with_row_range` | +| Slice (negative, e.g. `.tail(N)`) | ✅ | `restrict_to_bounds(row_count)` (footer-cached row count) | +| Filter — `==` / `!=` / `<` / `<=` / `>` / `>=` | ✅ | AExpr-direct convertor → `eq`/`not_eq`/`lt`/`lt_eq`/`gt`/`gt_eq` (pairwise-PType gate) | +| Filter — `and` / `or` / `not` | ✅ | AExpr-direct convertor → `and`/`or`/`not` (bitwise-vs-logical schema gate) | +| Filter — `is_null` / `is_not_null` | ✅ | AExpr-direct convertor → `is_null`/`is_not_null` | +| Filter — arithmetic (`col + 1 > 5`) | ✅ | AExpr-direct convertor → `checked_add` (numeric + pairwise-PType gate) | +| Filter — `CAST(col, target)` (Strict, same-kind) | ✅ | AExpr-direct convertor → `cast` (Primitive↔Primitive, Bool↔Bool, Utf8↔Utf8) | +| Filter — struct field access | ✅ | AExpr-direct convertor → `get_item(field, struct_expr)` (schema-membership gate) | +| Filter — `is_between(lo, hi)` | ✅ | AExpr-direct convertor (PR-2.7) → `(col gt[_eq] lo) AND (col lt[_eq] hi)` per closed (schema-gated pairwise-PType) | +| Filter — `is_in([...])` | ✅ | AExpr-direct convertor (PR-2.7) → OR of equalities; refuses if `nulls_equal=true` + nulls present OR any haystack scalar fails Polars→Vortex conversion | +| Filter — `starts_with` / `ends_with` / `contains(literal=True)` | ✅ | AExpr-direct convertor (PR-2.7) → `like(col, "prefix%" / "%suffix" / "%sub%")` (schema-gated Utf8 input); refuses if needle contains LIKE wildcards (% _ \\) | +| Filter — `when(...).then(...).otherwise(...)` (Ternary) | ✅ | AExpr-direct convertor (PR-2.7) → `case_when(cond, then, else)` (schema-gated THEN/ELSE pairwise-dtype) | +| Filter — temporal extracts (`col.dt.year()`) | ❌ residual | Deferred (PR-2.5 slip) — Vortex `datetime_parts` op unavailable at pinned SHA | +| Filter — non-Strict CAST (`NonStrict` / `Overflowing`) | ❌ residual | Polars overflow→null/wrap diverges from Vortex fail-on-overflow | +| Filter — cross-kind CAST (Primitive↔Utf8 etc.) | ❌ residual | Vortex per-array `CastKernel` is strictly within-kind | +| Filter — temporal scalar literals (Date/Datetime/Time) | ✅ (when `dtype-*` features on) | Vortex Date/Time/Timestamp extension scalars | +| Filter — `Decimal` scalar literals | ✅ (when `dtype-decimal` on) | Vortex `DecimalValue::I128` | +| Filter — `Duration` scalars | ❌ residual | Vortex has no Duration extension dtype yet | +| Filter — regex | ❌ residual | Vortex `like` doesn't do regex | +| Zone-level pruning | ✅ | Vortex's `LayoutReader::pruning_evaluation` when given any filter | +| Hive partitioning | ✅ (free) | `UnifiedScanArgs::hive_options` | +| Schema evolution | ✅ (free) | `UnifiedScanArgs::{cast_columns_policy, missing_columns_policy, extra_columns_policy}` | +| Row index | ✅ (free) | `UnifiedScanArgs::row_index` (attached post-decode) | Residual predicates are always re-applied by Polars' multi-scan layer post-decode — partial pushdown is correct, never _less correct_ than no pushdown. @@ -245,7 +279,7 @@ crates/polars-vortex/ │ ├── options.rs # VortexScanOptions, VortexCacheMode (with resolve()) │ ├── schema.rs # Vortex DType → polars-arrow ArrowSchema walker │ ├── read_at.rs # PolarsInstrumentedVortexReadAt decorator - │ ├── predicate.rs # ColumnPredicates → Vortex Expression + │ ├── predicate.rs # polars_scalar_to_vortex (Scalar → VortexScalar) │ └── array_bridge.rs # upstream RecordBatch → DataFrame via C-ABI └── write/ ├── mod.rs @@ -356,9 +390,12 @@ pl.set_vortex_cache_bytes(byte_budget: int) # 0 = disable; default 512 MiB - ✅ `pl.scan_vortex(...).tail(N).collect()` — negative slice pushed via the footer's row count - ✅ `df.write_vortex(path)` and `lf.sink_vortex(path)` (local AND `s3://`/`gs://`/`az://`) - ✅ `lf.sink_vortex(pl.PartitionBy("base/", by=[...]))` — partitioned writes -- ✅ Filter pushdown for Equal / Between / EqualOneOf / StartsWith / EndsWith, and (when built with - `dtype-date` / `dtype-datetime` / `dtype-time` / `dtype-decimal`) for Date / Datetime / Time / - Decimal scalar literals. +- ✅ Filter pushdown for the comparison operators (`==`, `!=`, `<`, `<=`, `>`, `>=`), boolean + combinators (`and`/`or`/`not`), null checks (`is_null`/`is_not_null`), Plus arithmetic + (`col + 1`), same-kind `CAST` (Strict only), and struct field access — all via the AExpr-direct + convertor (`polars-plan/src/plans/aexpr/predicates/vortex_convertor.rs`). Multi-column predicates + compose freely. (When built with `dtype-date` / `dtype-datetime` / `dtype-time` / `dtype-decimal`, + Date / Datetime / Time / Decimal scalar literals work as comparison operands.) - ✅ Multiple Vortex files in one scan (multi-file glob) via `UnifiedScanArgs` - ✅ Hive partitioning on read via `UnifiedScanArgs::hive_options` - ✅ Process-global decompressed-segment cache with `pl.set_vortex_cache_bytes(N)` @@ -367,9 +404,14 @@ pl.set_vortex_cache_bytes(byte_budget: int) # 0 = disable; default 512 MiB ## Known limits / pending follow-ups -- **Aggressive predicate pushdown** (arithmetic, CAST, struct field access, temporal extracts) is - not yet wired — these stay as residual filters. Implementing them requires walking AExpr at - IR-build time instead of relying on `SpecializedColumnPredicate`. +- **Temporal extracts** (`col.dt.year()`, `col.dt.month()`, etc.) — Vortex 0.70.0 doesn't expose + `datetime_parts` / `year` / `month` / `day` builders in its public expr API. PR-2.5 slipped to + Deferred work; reasonable to revisit when upstream Vortex exposes them or when polars-vortex bumps + its pinned version. +- **`Duration` scalar literals** — Vortex has no Duration extension dtype. +- **Non-Strict CAST** (`CastOptions::NonStrict` / `Overflowing`) and **cross-kind CAST** + (Primitive↔Bool/Utf8) — Vortex's per-array `CastKernel` semantics don't match Polars's + silent-or-null overflow behavior; convertor refuses to preserve always-safe-fallback contract. - **File-level optimizer stats** (whole-file pruning at IR time) are not wired. Vortex's zone-level pruning already runs inside the scan, so this is a modest optimization — useful mainly for very large multi-file scans where opening every file at IR time is acceptable. diff --git a/crates/polars-vortex/src/lib.rs b/crates/polars-vortex/src/lib.rs index e84879389e12..76a9cd875cbd 100644 --- a/crates/polars-vortex/src/lib.rs +++ b/crates/polars-vortex/src/lib.rs @@ -13,12 +13,16 @@ pub mod write; /// Re-exports of upstream Vortex types/macros used across polars-stream's Vortex source/sink, /// polars-plan's IR conversion, and polars-vortex tests. **Invariant**: anything outside -/// `vortex::{array, error, file, io, layout}` fails to compile — narrowed from `pub use -/// ::vortex;` in PR-1.4 so the BAN against new `vortex`-internal symbol use outside the bridge -/// files is machine-checkable. Re-run the actual-use audit (grep for `polars_vortex::vortex::`) -/// before adding a new sub-module to this list. +/// `vortex::{array, dtype, error, expr, file, io, layout}` fails to compile — narrowed from +/// `pub use ::vortex;` in PR-1.4 so the BAN against new `vortex`-internal symbol use outside +/// the bridge files is machine-checkable. Re-run the actual-use audit (grep for +/// `polars_vortex::vortex::`) before adding a new sub-module to this list. +/// +/// `dtype` added in PR-2.3 so the AExpr-direct convertor's CAST arm can build target +/// `vortex::dtype::DType` values without going through the polars-arrow→upstream-arrow +/// schema bridge (overkill for a single dtype). See `vortex_convertor::polars_dtype_to_vortex_dtype`. pub mod vortex { - pub use ::vortex::{array, error, file, io, layout}; + pub use ::vortex::{array, dtype, error, expr, file, io, layout}; } pub use read::options::{VortexCacheMode, VortexScanOptions}; pub use session::session; diff --git a/crates/polars-vortex/src/read/predicate.rs b/crates/polars-vortex/src/read/predicate.rs index 78c2683fdcf0..608847076585 100644 --- a/crates/polars-vortex/src/read/predicate.rs +++ b/crates/polars-vortex/src/read/predicate.rs @@ -1,148 +1,68 @@ -//! Polars predicate → Vortex `Expression` convertor (filter pushdown). +//! Polars `Scalar` → Vortex `VortexScalar` conversion (filter pushdown literal helper). //! -//! **TODO(PR-2.6)**: this entire fast path is SCAFFOLDING for the PR-13 Option B → A -//! trajectory. The AExpr-direct convertor at `read/aexpr_predicate.rs` (introduced in PR-2.1 -//! through PR-2.5) supersedes this `SpecializedColumnPredicate`-based fast path; PR-2.6 deletes -//! this file (or reduces it to scalar+LIKE helpers absorbed into `aexpr_predicate.rs`) and -//! switches the call site at `crates/polars-stream/src/nodes/io_sources/vortex/mod.rs:242` to -//! the AExpr-direct path. Tracked: `.big-plans/vortex-integration.md` PR-2.6 row + Accepted -//! tradeoffs entry on the SpecializedColumnPredicate fast path. +//! **PR-2.6 Option B → A cutover**: this module USED to host the legacy +//! `SpecializedColumnPredicate`-derived filter-pushdown path +//! (`polars_to_vortex_predicate`, `convert_specialized`, `bytes_to_like_literal` for +//! LIKE prefix/suffix). PR-2.6 deletes that path entirely — the AExpr-direct convertor +//! at `polars_plan::plans::predicates::vortex_convertor::aexpr_to_vortex_expression` +//! (the `aexpr` module is `pub(crate)`; the externally-resolvable path goes through +//! the `pub use aexpr::*` re-export at `plans/mod.rs`; introduced in PR-2.1, wired at +//! `polars-stream/src/physical_plan/lower_ir.rs` in PR-2.2) is the sole filter-pushdown +//! path going forward. //! -//! We translate the structured pieces of [`polars_io::predicates::ScanIOPredicate`] into a -//! Vortex `Expression` to hand to `ScanBuilder::with_filter`. What we can't translate stays -//! as a residual filter, which the multi-scan layer applies post-decode (the streaming -//! reader advertises `PARTIAL_FILTER` capability so the multi-scan layer knows to keep the -//! original predicate around). +//! ## Coverage parity with the deleted legacy path //! -//! ## What we translate +//! As of PR-2.7 + PR-2.8 (Phase 2 amend), the AExpr-direct convertor handles: +//! - Scalar comparisons (Eq / NotEq / Lt / LtEq / Gt / GtEq) — direct mapping +//! - Boolean combinators (And / Or / Not / IsNull / IsNotNull) — direct mapping +//! - Plus arithmetic (numeric, same-PType only) +//! - CAST (same-kind: Primitive↔Primitive, Bool↔Bool, Utf8↔Utf8; Strict only) +//! - Struct field access (`col.struct.field("inner")`) +//! - `is_between(lo, hi)` (`AExpr::Function::Boolean(IsBetween)`) — PR-2.7 +//! - `is_in([...])` (`AExpr::Function::Boolean(IsIn)`) — PR-2.7 +//! - `str.starts_with(prefix)` / `str.ends_with(suffix)` (`AExpr::Function::StringExpr(...)`) — PR-2.7 +//! - `str.contains(needle, literal=True)` (maps to `like(col, "%needle%")`) — PR-2.7 +//! - `Ternary { predicate, then, otherwise }` (maps to `case_when([(p, t)], Some(o))`) — PR-2.7 +//! - Per-column file-vs-virtual minterm split for hive / row_index / include_file_paths +//! (helper `aexpr_file_minterms_to_vortex_expression`) — PR-2.8 +//! - Multi-column predicates (everything composable via the above) //! -//! The high-leverage path is [`ColumnPredicates`]: the Polars optimizer already extracts -//! single-column predicates into [`SpecializedColumnPredicate`] variants (Equal / Between / -//! EqualOneOf / StartsWith / EndsWith / RegexMatch). Each maps to a Vortex expression node -//! cleanly. We collect all per-column predicates and `and`-collect them into a single -//! filter. +//! Remaining residual shapes (perf-only — correctness preserved via `PARTIAL_FILTER`): +//! - Non-Strict CAST + cross-kind CAST (refused at the CAST arm's kind gate) +//! - `str.contains(needle, literal=False)` (regex; refused at the StringExpr arm) +//! - Temporal extracts (`col.dt.year()` etc.) — Vortex 0.70.0 doesn't expose the builders; +//! tracked at `.big-plans/vortex-integration.md` Deferred entry "Temporal-extract +//! predicate pushdown" +//! - Float16 — neither `polars_dtype_to_vortex_dtype` nor `is_vortex_numeric_dtype` +//! handles `F16` yet; tracked as Deferred //! -//! Multi-column predicates, arithmetic, struct field access, and `RegexMatch` stay as -//! residual for now (tracked under PR-13 — aggressive AExpr-based pushdown). They're -//! correct because the multi-scan layer always applies the original -//! `predicate.predicate` post-decode. +//! ## What this module still does +//! +//! Hosts [`polars_scalar_to_vortex`] — the canonical `polars_core::scalar::Scalar` → +//! [`VortexScalar`] mapping. The AExpr-direct convertor calls into this for +//! `AExpr::Literal(LiteralValue::Scalar(s))` shapes (single source of truth for the +//! `AnyValue` → `VortexScalar` mapping; same `pub` visibility established in PR-2.1). +//! Temporal (Date / Datetime / Time) and Decimal arms live in the `temporal` submodule +//! so they can be feature-gated cleanly on polars-core's `dtype-*` features. use polars_core::prelude::AnyValue; -use polars_io::predicates::{ScanIOPredicate, SpecializedColumnPredicate}; -use polars_utils::pl_str::PlSmallStr; use vortex::array::scalar::Scalar as VortexScalar; use vortex::dtype::Nullability; -use vortex::expr::{ - Expression, and_collect, eq, get_item, gt_eq, like, lit, lt_eq, or_collect, root, -}; -/// Convert what we can of `scan_predicate` into a single Vortex filter expression. The -/// returned expression should be passed to `ScanBuilder::with_filter`; the multi-scan -/// layer is responsible for the residual (full `predicate.predicate` is re-applied to -/// emitted morsels). -/// -/// Returns `None` when nothing pushable was found. +/// Convert a Polars [`polars_core::scalar::Scalar`] into a Vortex [`VortexScalar`]. /// -/// Conjuncts are emitted in column-name sorted order so the resulting Vortex -/// `Expression` is deterministic — `ColumnPredicates::predicates` is a hash map -/// whose iteration order varies, and Vortex's pruning evaluator may short-circuit -/// left-to-right, so the order matters for both reproducibility and (potentially) -/// pruning effectiveness. -pub fn polars_to_vortex_predicate(scan_predicate: &ScanIOPredicate) -> Option { - let mut per_column_pairs: Vec<(&PlSmallStr, &SpecializedColumnPredicate)> = scan_predicate - .column_predicates - .predicates - .iter() - .filter_map(|(name, (_, specialized_opt))| specialized_opt.as_ref().map(|s| (name, s))) - .collect(); - per_column_pairs.sort_by_key(|(name, _)| name.as_str()); - - let per_column: Vec = per_column_pairs - .into_iter() - .filter_map(|(name, specialized)| convert_specialized(name, specialized)) - .collect(); - and_collect(per_column) -} - -fn convert_specialized( - column_name: &PlSmallStr, - specialized: &SpecializedColumnPredicate, -) -> Option { - let col = get_item(column_name.as_str(), root()); - - Some(match specialized { - SpecializedColumnPredicate::Equal(scalar) => eq(col, lit(polars_scalar_to_vortex(scalar)?)), - SpecializedColumnPredicate::Between(low, high) => { - let lo = lit(polars_scalar_to_vortex(low)?); - let hi = lit(polars_scalar_to_vortex(high)?); - // Closed range: low <= col <= high. - vortex::expr::and(gt_eq(col.clone(), lo), lt_eq(col, hi)) - }, - SpecializedColumnPredicate::EqualOneOf(scalars) => { - let terms: Vec = scalars - .iter() - .filter_map(|s| Some(eq(col.clone(), lit(polars_scalar_to_vortex(s)?)))) - .collect(); - // If every scalar in the IN-list converted, push the OR; if some failed we - // could still push the partial set + leave a residual, but for safety we - // require all-or-nothing here (otherwise the pushed filter is *narrower* - // than the user's actual predicate, which would drop rows incorrectly). - if terms.len() != scalars.len() { - return None; - } - or_collect(terms)? - }, - SpecializedColumnPredicate::StartsWith(bytes) => { - let prefix = bytes_to_like_literal(bytes)?; - // `prefix%` - let pattern = format!("{prefix}%"); - like( - col, - lit(VortexScalar::utf8(pattern, Nullability::NonNullable)), - ) - }, - SpecializedColumnPredicate::EndsWith(bytes) => { - let suffix = bytes_to_like_literal(bytes)?; - // `%suffix` - let pattern = format!("%{suffix}"); - like( - col, - lit(VortexScalar::utf8(pattern, Nullability::NonNullable)), - ) - }, - // No native regex in Vortex's `like`; let the multi-scan residual handle it. - SpecializedColumnPredicate::RegexMatch(_) => return None, - }) -} - -/// Validate that `bytes` is valid UTF-8 and free of SQL-LIKE special characters -/// (`%`, `_`, `\`). Returns the borrowed `&str` so the caller can build a pattern. -/// Returning `None` falls back to the residual filter, which is always correct. -/// -/// We refuse pushdown when the bytes contain `%` or `_` because LIKE would interpret -/// those as wildcards, *widening* the predicate. Widening is still correct (the -/// multi-scan residual filter trims the extra rows), but it defeats the perf win -/// of pushdown — so we'd rather not push than push wastefully. Backslash is the -/// LIKE escape character; same reasoning. -fn bytes_to_like_literal(bytes: &[u8]) -> Option<&str> { - let s = std::str::from_utf8(bytes).ok()?; - if s.contains('%') || s.contains('_') || s.contains('\\') { - return None; - } - Some(s) -} - -/// Convert a Polars `Scalar` into a Vortex `Scalar` for the common types. -/// Returns `None` for variants we don't yet translate (extension types we don't have -/// a Vortex analogue for, nested types, etc.) — the caller treats this as "not -/// pushable" and falls back to the residual filter. +/// Returns `None` for variants we don't yet translate (Duration — no Vortex extension +/// dtype analogue; nested types; extension dtypes without a Vortex equivalent). The +/// AExpr-direct convertor's `?`-propagation drops the enclosing predicate to residual on +/// `None`. /// -/// Scalars are constructed with `Nullability::Nullable` since the optimizer's -/// `SpecializedColumnPredicate` doesn't carry the column's nullability. Vortex's -/// type system unifies nullability when comparing against a `NonNullable` column, -/// so this is correct but may reduce pruning effectiveness if Vortex's pruning -/// evaluator is stricter than its comparison evaluator. -fn polars_scalar_to_vortex(scalar: &polars_core::scalar::Scalar) -> Option { +/// Scalars are constructed with `Nullability::Nullable`. The AExpr-direct convertor's +/// caller (`AExpr::Literal(LiteralValue::Scalar(_))`) doesn't track which column the +/// literal will be compared against, so we cannot specialize on the column's +/// nullability. Vortex's type system unifies nullability when comparing against a +/// `NonNullable` column, so this is correct but may reduce pruning effectiveness if +/// Vortex's pruning evaluator is stricter than its comparison evaluator. +pub fn polars_scalar_to_vortex(scalar: &polars_core::scalar::Scalar) -> Option { let nul = Nullability::Nullable; Some(match scalar.value() { AnyValue::Null => return None, // Vortex `null` requires a known DType; skip for now. @@ -267,29 +187,6 @@ mod temporal { mod tests { use super::*; - #[test] - fn like_literal_safe_strings() { - assert_eq!(bytes_to_like_literal(b"hello"), Some("hello")); - assert_eq!(bytes_to_like_literal(b""), Some("")); - assert_eq!(bytes_to_like_literal(b"a.b-c@d"), Some("a.b-c@d")); - } - - #[test] - fn like_literal_refuses_wildcards() { - // SQL-LIKE special chars must NOT be pushed — they'd widen the predicate. - assert_eq!(bytes_to_like_literal(b"hello%world"), None); - assert_eq!(bytes_to_like_literal(b"foo_bar"), None); - assert_eq!(bytes_to_like_literal(b"a\\b"), None); - assert_eq!(bytes_to_like_literal(b"%"), None); - assert_eq!(bytes_to_like_literal(b"_"), None); - assert_eq!(bytes_to_like_literal(b"\\"), None); - } - - #[test] - fn like_literal_refuses_invalid_utf8() { - assert_eq!(bytes_to_like_literal(&[0xff, 0xfe]), None); - } - #[test] fn scalar_primitive_types_convert() { use polars_core::prelude::DataType; @@ -382,11 +279,11 @@ mod tests { #[cfg(feature = "dtype-decimal")] #[test] - fn decimal_scalar_roundtrips_precision_and_scale() { + fn decimal_scalar_round_trip() { use vortex::dtype::DType; - + // Polars Decimal(10, 2) — small precision/scale, fits in u8/i8 easily. let s = - temporal::decimal_scalar(12_345, 10, 2, Nullability::Nullable).expect("decimal scalar"); + temporal::decimal_scalar(12345, 10, 2, Nullability::Nullable).expect("decimal scalar"); match s.dtype() { DType::Decimal(dec, _) => { assert_eq!(dec.precision(), 10); @@ -398,112 +295,9 @@ mod tests { #[cfg(feature = "dtype-decimal")] #[test] - fn decimal_scalar_rejects_overflowing_precision_and_scale() { - // u8::MAX is 255; usize::try_into:: fails for 256. + fn decimal_scalar_overflow_returns_none() { + // precision > 38 OR scale > i8::MAX would overflow the Vortex types; refuse. assert!(temporal::decimal_scalar(0, 256, 0, Nullability::Nullable).is_none()); - // i8::MAX is 127; usize::try_into:: fails for 128. - assert!(temporal::decimal_scalar(0, 10, 128, Nullability::Nullable).is_none()); - } - - #[cfg(feature = "dtype-date")] - #[test] - fn convertor_returns_pushable_for_date_predicate() { - use polars_core::prelude::DataType; - use polars_core::scalar::Scalar; - use polars_io::predicates::SpecializedColumnPredicate; - - let scalar = Scalar::new(DataType::Date, AnyValue::Date(19_000)); - let pred = SpecializedColumnPredicate::Equal(scalar); - let expr = convert_specialized(&"d".into(), &pred); - assert!( - expr.is_some(), - "Date equality should be pushable when dtype-date is on" - ); - } - - // ======================================================================== - // Per-variant pushdown-engagement tests: each `SpecializedColumnPredicate` - // variant we claim to support should produce a non-None Vortex expression. - // ======================================================================== - - fn int32_scalar(v: i32) -> polars_core::scalar::Scalar { - use polars_core::prelude::DataType; - use polars_core::scalar::Scalar; - Scalar::new(DataType::Int32, AnyValue::Int32(v)) - } - - #[test] - fn equal_predicate_is_pushable() { - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::Equal(int32_scalar(42)); - assert!(convert_specialized(&"a".into(), &pred).is_some()); - } - - #[test] - fn between_predicate_is_pushable() { - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::Between(int32_scalar(1), int32_scalar(10)); - assert!(convert_specialized(&"a".into(), &pred).is_some()); - } - - #[test] - fn equal_one_of_predicate_is_pushable() { - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::EqualOneOf( - vec![int32_scalar(1), int32_scalar(2), int32_scalar(3)].into_boxed_slice(), - ); - assert!(convert_specialized(&"a".into(), &pred).is_some()); - } - - #[test] - fn starts_with_predicate_is_pushable_for_safe_bytes() { - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::StartsWith(b"hello".to_vec().into_boxed_slice()); - assert!(convert_specialized(&"s".into(), &pred).is_some()); - } - - #[test] - fn starts_with_predicate_refuses_unsafe_bytes() { - // Wildcard chars trigger the safety check — return None so the residual - // filter handles it correctly. - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::StartsWith(b"hello%".to_vec().into_boxed_slice()); - assert!(convert_specialized(&"s".into(), &pred).is_none()); - } - - #[test] - fn ends_with_predicate_is_pushable_for_safe_bytes() { - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::EndsWith(b"world".to_vec().into_boxed_slice()); - assert!(convert_specialized(&"s".into(), &pred).is_some()); - } - - #[test] - fn regex_match_predicate_falls_back_to_residual() { - // RegexMatch is documented as residual-only. - use polars_io::predicates::SpecializedColumnPredicate; - let regex = regex::bytes::Regex::new("^foo").unwrap(); - let pred = SpecializedColumnPredicate::RegexMatch(regex); - assert!(convert_specialized(&"s".into(), &pred).is_none()); - } - - #[test] - fn equal_one_of_with_partial_failure_returns_none() { - // Documented behavior: if any scalar in the IN-list fails to convert - // (e.g., AnyValue::Null), the whole predicate falls back to residual - // — pushing a partial set would be narrower than the user's actual - // predicate, which would silently drop rows. - use polars_core::prelude::DataType; - use polars_core::scalar::Scalar; - use polars_io::predicates::SpecializedColumnPredicate; - let pred = SpecializedColumnPredicate::EqualOneOf( - vec![ - int32_scalar(1), - Scalar::new(DataType::Int32, AnyValue::Null), - int32_scalar(3), - ] - .into_boxed_slice(), - ); - assert!(convert_specialized(&"a".into(), &pred).is_none()); + assert!(temporal::decimal_scalar(0, 0, 256, Nullability::Nullable).is_none()); } } diff --git a/py-polars/tests/unit/io/test_vortex.py b/py-polars/tests/unit/io/test_vortex.py index cd99d194bd77..840436c39b5a 100644 --- a/py-polars/tests/unit/io/test_vortex.py +++ b/py-polars/tests/unit/io/test_vortex.py @@ -113,6 +113,180 @@ def test_scan_with_filter(tmp_path: Path) -> None: assert out["a"].to_list() == [15, 16, 17, 18, 19] +def test_scan_with_arithmetic_filter(tmp_path: Path) -> None: + """PR-13.2 acceptance: ``col + 1 == 5`` pushes down via the AExpr convertor. + + The legacy ``SpecializedColumnPredicate``-derived path cannot represent + arithmetic on a column reference — only literal comparisons / IN-lists / + range. PR-13.2 wires the convertor at ``physical_plan::lower_ir`` so the + Vortex source receives a real ``a + 1 == 5`` Vortex expression. We assert + correctness here (Polars reapplies post-decode regardless, so any drop-rows + bug would be the *more* dangerous failure mode; a pushdown-not-applied + regression would manifest as slower but still-correct results). + """ + path = tmp_path / "arith_filter.vortex" + df = pl.DataFrame({"a": list(range(20)), "b": [str(i) for i in range(20)]}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("a") + 1 == 5).collect() + assert out.shape == (1, 2) + assert out["a"].to_list() == [4] + assert out["b"].to_list() == ["4"] + + +def test_scan_with_cast_filter(tmp_path: Path) -> None: + """PR-13.3 acceptance: ``col.cast(Int64) > 100`` pushes down via the CAST arm. + + Convertor maps `AExpr::Cast { dtype: Int64, options: Strict }` → + `vortex::expr::cast(child, DType::Primitive(I64, Nullable))` ONLY when the + source dtype is in the same Vortex kind (Primitive↔Primitive, here Int32 + → Int64 is Primitive→Primitive). The legacy `SpecializedColumnPredicate` + fast path cannot represent a CAST on the column side, so without PR-2.3 + this would fall back to no-pushdown. + """ + path = tmp_path / "cast_filter.vortex" + df = pl.DataFrame({"a": pl.Series([1, 50, 101, 200], dtype=pl.Int32)}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("a").cast(pl.Int64) > 100).collect() + assert out.shape == (2, 1) + assert out["a"].to_list() == [101, 200] + + +def test_scan_with_struct_field_filter(tmp_path: Path) -> None: + """PR-13.4 acceptance: struct field access pushes down via the StructField arm. + + Convertor maps `AExpr::Function { StructExpr(FieldByName("inner")), .. }` → + `vortex::expr::get_item("inner", inner_struct_expr)`. The legacy + `SpecializedColumnPredicate` fast path cannot represent struct field access + on the column side. Schema-membership gate refuses pushdown when the field + doesn't exist in the struct's dtype (else Vortex's `GetItem.return_dtype` + `vortex_err!`s at scan-time). + """ + path = tmp_path / "struct_filter.vortex" + df = pl.DataFrame( + { + "s": [ + {"inner": "a", "count": 1}, + {"inner": "x", "count": 2}, + {"inner": "x", "count": 3}, + {"inner": "z", "count": 4}, + ] + } + ) + df.write_vortex(path) + + out = ( + pl.scan_vortex(path).filter(pl.col("s").struct.field("inner") == "x").collect() + ) + assert out.shape == (2, 1) + assert out["s"].struct.field("count").to_list() == [2, 3] + + +def test_scan_with_cross_kind_cast_filter(tmp_path: Path) -> None: + """PR-2.3 cycle-1 must-fix: cross-kind CAST (Primitive → Utf8) must not crash. + + Pre-fix: convertor emitted `cast(get_item("a", root()), DType::Utf8(...))`, + which Vortex's `Primitive::CastKernel` doesn't handle (returns + `Ok(None)` for non-Primitive targets), causing `cast/mod.rs:120` to + `vortex_bail!("No CastKernel ...")` at scan-time — propagating as a + hard `ComputeError`. + + Post-fix: `cast_kind_compatible` refuses the convertor pushdown so the + legacy `polars_to_vortex_predicate` fallback handles the predicate + (which also can't represent the cast — falls through to no-pushdown). + Polars post-decode reapply produces the correct results. + + A regression dropping the source-dtype-kind gate would surface here as + a Vortex scan-time error. + """ + path = tmp_path / "cross_kind_cast.vortex" + df = pl.DataFrame({"a": pl.Series([1, 50, 101, 200], dtype=pl.Int32)}) + df.write_vortex(path) + + # Int32 → String CAST then string equality. Must not crash; Polars + # post-decode handles correctly. + out = pl.scan_vortex(path).filter(pl.col("a").cast(pl.String) == "101").collect() + assert out.shape == (1, 1) + assert out["a"].to_list() == [101] + + +def test_scan_with_hive_partitioning_and_filter(tmp_path: Path) -> None: + """PR-2.2 cycle-1 M1 regression: hive scan + hive-only filter must not crash. + + Without protection, the convertor at ``lower_ir.rs`` would emit a Vortex + ``get_item('year', root())`` reference to a column that doesn't exist in + the per-file Vortex data (``year`` is a HIVE virtual column, synthesized + after decode from the directory structure). + + Protection mechanism (PR-2.8 update): the helper + ``aexpr_file_minterms_to_vortex_expression`` walks top-level conjuncts + via ``MintermIter`` and drops minterms whose leaves are in + ``virtual_cols`` (built from ``hive_parts.schema()`` + ``row_index.name`` + + ``include_file_paths``). The single minterm ``year == 2024`` references + only ``year`` (a hive virtual col), so the helper drops it; + ``and_collect(vec![])`` returns ``None``; pushdown is refused; Polars' + hive-partition pruning + multi-scan ``PARTIAL_FILTER`` reapply produces + the correct result. + + Pre-PR-2.8 mechanism: ``lower_ir.rs`` had an all-or-nothing virtual-col + guard that refused the WHOLE predicate when ``hive_parts.is_some()``. + The guard was REPLACED by PR-2.8's per-minterm split (which is + strictly-better — see ``test_scan_with_hive_and_file_col_mixed_filter`` + for the mixed-shape case it now handles). + + A regression where the per-column split mis-classified ``year`` as a + file column would surface as a ``ComputeError`` from Vortex bailing on + a missing column 'year'. + """ + (tmp_path / "year=2024").mkdir() + (tmp_path / "year=2025").mkdir() + pl.DataFrame({"x": [1, 2, 3]}).write_vortex(tmp_path / "year=2024" / "data.vortex") + pl.DataFrame({"x": [4, 5, 6]}).write_vortex(tmp_path / "year=2025" / "data.vortex") + + out = ( + pl.scan_vortex(tmp_path / "**/*.vortex", hive_partitioning=True) + .filter(pl.col("year") == 2024) + .collect() + ) + assert out.shape == (3, 2) + assert sorted(out["x"].to_list()) == [1, 2, 3] + assert out["year"].unique().to_list() == [2024] + + +def test_scan_with_row_index_and_filter(tmp_path: Path) -> None: + """PR-2.2 cycle-2 C2-001 regression: row_index virtual col + filter must not crash. + + ``row_index_name`` synthesizes ``ri`` as a virtual column after decode; + ``ri`` is not present in the Vortex file's data. Without protection, + the convertor would emit a Vortex ``get_item('ri', root())`` reference + that Vortex can't resolve. + + Protection mechanism (PR-2.8 update): the single minterm ``ri > 10`` + references only ``ri`` (a row_index virtual col, included in + ``virtual_cols`` at ``lower_ir.rs``); the helper + ``aexpr_file_minterms_to_vortex_expression`` drops the minterm; + ``and_collect(vec![])`` returns ``None``; pushdown is refused; Polars' + row-index materialization + multi-scan ``PARTIAL_FILTER`` reapply + produces the correct result. + + Pre-PR-2.8 mechanism: ``lower_ir.rs`` had an extended all-or-nothing + guard that refused the WHOLE predicate when ``row_index.is_some()`` OR + ``include_file_paths.is_some()``. The guard was REPLACED by PR-2.8's + per-minterm split (see ``test_scan_with_row_index_and_file_col_mixed_filter`` + for the mixed-shape case it now handles). + + A regression where the per-column split mis-classified ``ri`` as a file + column would surface as a ``ComputeError`` from Vortex bailing on + missing column 'ri'. + """ + path = tmp_path / "ri.vortex" + pl.DataFrame({"x": list(range(20))}).write_vortex(path) + + out = pl.scan_vortex(path, row_index_name="ri").filter(pl.col("ri") > 10).collect() + assert out["ri"].to_list() == list(range(11, 20)) + + def test_scan_with_projection(tmp_path: Path) -> None: path = tmp_path / "proj.vortex" df = pl.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]}) @@ -275,3 +449,263 @@ def test_multifile_scan_missing_columns_raise(tmp_path: Path) -> None: with pytest.raises(pl.exceptions.PolarsError): pl.scan_vortex([a, b], missing_columns="raise").collect() +# === PR-2.7 amend: cutover-lost pushdown shapes === +# +# Each test exercises a shape that the PR-2.6 cutover removed from the +# pushdown path. Correctness via Polars's post-decode reapply is invariant +# (PARTIAL_FILTER capability), so a "pushdown not applied" regression would +# manifest as slower-but-correct results — these tests catch the more +# dangerous failure mode: pushdown emitting a Vortex expression that +# silently drops or duplicates rows. Engagement verification is implicit +# (the bench harness from PR-1.5 measures wall-clock). + + +def test_scan_with_is_between_filter(tmp_path: Path) -> None: + """PR-2.7 cycle 1: ``col.is_between(lo, hi)`` pushes down via is_between arm. + + Decomposed to ``(col >= lo) AND (col <= hi)``. + The legacy SpecializedColumnPredicate path handled this via + ``SpecializedColumnPredicate::Between``; PR-2.6 cutover removed it. + The new convertor arm re-establishes pushdown by decomposing to a + Vortex ``and(gt_eq, lt_eq)`` (or strict variants per ``closed``). + """ + path = tmp_path / "between_filter.vortex" + df = pl.DataFrame({"a": list(range(20))}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("a").is_between(5, 10)).collect() + # Default closed="both" → 5, 6, 7, 8, 9, 10 + assert out["a"].to_list() == [5, 6, 7, 8, 9, 10] + + +def test_scan_with_is_between_left_closed_filter(tmp_path: Path) -> None: + """PR-2.7 cycle 1: ``is_between`` with non-default closed kwarg. + + ClosedInterval::Left → ``(col >= lo) AND (col < hi)``. Verifies the + closed-interval-variant mapping isn't off-by-one. + """ + path = tmp_path / "between_left_filter.vortex" + df = pl.DataFrame({"a": list(range(20))}) + df.write_vortex(path) + + out = ( + pl.scan_vortex(path) + .filter(pl.col("a").is_between(5, 10, closed="left")) + .collect() + ) + # closed="left" → 5, 6, 7, 8, 9 (10 excluded) + assert out["a"].to_list() == [5, 6, 7, 8, 9] + + +def test_scan_with_is_in_filter(tmp_path: Path) -> None: + """PR-2.7 cycle 1: ``col.is_in([...])`` pushes down via the is_in arm. + + Decomposed to ``(col == v1) OR (col == v2) OR ...``. + Legacy ``SpecializedColumnPredicate::EqualOneOf`` handled this; PR-2.6 + cutover removed it. The new convertor arm reuses the polars-plan-internal + ``try_extract_is_in_haystack`` helper for haystack extraction (same code + path as the deleted SpecializedColumnPredicate route) so the + constant-eval / list-dispatch / null-drop logic stays consistent. + """ + path = tmp_path / "is_in_filter.vortex" + df = pl.DataFrame({"a": list(range(20))}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("a").is_in([1, 3, 5, 7])).collect() + assert out["a"].to_list() == [1, 3, 5, 7] + + +def test_scan_with_starts_with_filter(tmp_path: Path) -> None: + r"""PR-2.7 cycle 1: ``col.str.starts_with("prefix")`` pushes down. + + Via StringExpr arm, emitted as ``like(col, lit("prefix%"))``. + Legacy ``SpecializedColumnPredicate::StartsWith`` handled this; PR-2.6 + cutover removed it. The needle is escaped via ``bytes_to_like_literal`` + which refuses pushdown if the prefix contains LIKE wildcards (%, _, \\). + """ + path = tmp_path / "starts_with_filter.vortex" + df = pl.DataFrame({"s": ["apple", "apricot", "banana", "blueberry", "cherry"]}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("s").str.starts_with("ap")).collect() + assert out["s"].to_list() == ["apple", "apricot"] + + +def test_scan_with_ends_with_filter(tmp_path: Path) -> None: + """PR-2.7 cycle 1: ``col.str.ends_with("suffix")`` pushes down via StringExpr arm. + + Emitted as ``like(col, lit("%suffix"))``. + """ + path = tmp_path / "ends_with_filter.vortex" + df = pl.DataFrame({"s": ["apple", "pineapple", "banana", "grape"]}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("s").str.ends_with("apple")).collect() + # Both "apple" and "pineapple" end with "apple" + assert out["s"].to_list() == ["apple", "pineapple"] + + +def test_scan_with_contains_literal_filter(tmp_path: Path) -> None: + """PR-2.7 cycle 1: ``col.str.contains("sub", literal=True)`` pushes down. + + Via StringExpr arm, emitted as ``like(col, lit("%sub%"))``. + ``literal=False`` (regex mode) is REFUSED — Vortex's LIKE doesn't + support regex; the residual filter reapplies post-decode for correctness. + Tested implicitly by the unit-level + ``shape_contains_literal_false_returns_none`` test. + """ + path = tmp_path / "contains_filter.vortex" + df = pl.DataFrame( + {"s": ["hello world", "good morning", "world peace", "morning sun"]} + ) + df.write_vortex(path) + + out = ( + pl.scan_vortex(path) + .filter(pl.col("s").str.contains("world", literal=True)) + .collect() + ) + assert out["s"].to_list() == ["hello world", "world peace"] + + +def test_scan_with_ternary_filter(tmp_path: Path) -> None: + """PR-2.7 cycle 1: ``pl.when(...).then(...).otherwise(...)`` pushes down. + + Via Ternary arm, emitted as Vortex ``case_when(condition, then_value, else_value)``. + The Ternary returns a Boolean expression usable as a filter predicate. + Here: when ``a > 5``, push down ``a < 15``; otherwise emit False. + Effective predicate: ``5 < a < 15``. + """ + path = tmp_path / "ternary_filter.vortex" + df = pl.DataFrame({"a": list(range(20))}) + df.write_vortex(path) + + out = ( + pl.scan_vortex(path) + .filter( + pl.when(pl.col("a") > 5).then(pl.col("a") < 15).otherwise(False) + ) + .collect() + ) + # 5 < a < 15 → 6, 7, 8, 9, 10, 11, 12, 13, 14 + assert out["a"].to_list() == [6, 7, 8, 9, 10, 11, 12, 13, 14] + + +def test_scan_with_hive_and_file_col_mixed_filter(tmp_path: Path) -> None: + """PR-2.8: hive scan with mixed file-col + hive-col filter pushes file part. + + The predicate is split per-minterm; the file-col conjunct pushes to Vortex + while the hive-col conjunct stays for Polars' multi-scan reapply. + Pre-PR-2.8: the convertor's virtual-column guard at lower_ir.rs:801-815 + refused convertor pushdown ENTIRELY when hive_parts.is_some(); correctness + held via PARTIAL_FILTER reapply but the file-col part missed Vortex zone + pruning. + + Post-PR-2.8: `aexpr_file_minterms_to_vortex_expression` walks top-level + conjuncts via MintermIter and converts only the file-only ones. The + `x > 5` minterm pushes to Vortex; the `year == 2024` minterm stays + residual and Polars' hive-partition pruning + multi-scan reapply handles + it. Result correctness is preserved either way; the test confirms the + pipeline doesn't crash and returns the right rows. + + A regression where the per-column split mis-classified a hive col as + file (and tried to push it to Vortex) would surface as a ``ComputeError`` + from Vortex bailing on missing column 'year'. + """ + (tmp_path / "year=2024").mkdir() + (tmp_path / "year=2025").mkdir() + pl.DataFrame({"x": [1, 3, 5, 7, 9]}).write_vortex(tmp_path / "year=2024" / "data.vortex") + pl.DataFrame({"x": [2, 4, 6, 8, 10]}).write_vortex(tmp_path / "year=2025" / "data.vortex") + + out = ( + pl.scan_vortex(tmp_path / "**/*.vortex", hive_partitioning=True) + .filter((pl.col("x") > 5) & (pl.col("year") == 2024)) + .collect() + ) + # x > 5 AND year == 2024 → from year=2024 dir: 7, 9 (1, 3, 5 are filtered) + assert out.shape == (2, 2) + assert sorted(out["x"].to_list()) == [7, 9] + assert out["year"].unique().to_list() == [2024] + + +def test_scan_with_row_index_and_file_col_mixed_filter(tmp_path: Path) -> None: + """PR-2.8 cycle 2: row_index virtual col + file col mixed filter pushes file part. + + The predicate is split per-minterm: ``x > 5`` pushes to Vortex; ``ri > 10`` + stays residual and Polars' row-index materialization + multi-scan reapply + handles it. + Pre-PR-2.8 behavior: virtual-column guard refused the WHOLE predicate + when ``row_index.is_some()``; correctness held via PARTIAL_FILTER but the + file-col part missed Vortex zone pruning. + + A regression where the per-column split mis-classified ``ri`` as a file + column (and tried to push it to Vortex) would surface as a + ``ComputeError`` from Vortex bailing on missing column 'ri'. + """ + path = tmp_path / "ri_mixed.vortex" + pl.DataFrame({"x": list(range(20))}).write_vortex(path) + + out = ( + pl.scan_vortex(path, row_index_name="ri") + .filter((pl.col("x") > 5) & (pl.col("ri") > 10)) + .collect() + ) + # x: 0..20; ri: 0..20 (1:1 mapping). x > 5 → x ∈ {6..19}; ri > 10 → ri ∈ {11..19}. + # Intersection: x ∈ {11..19}, ri ∈ {11..19}, both 9 rows. + assert out.shape == (9, 2) + assert out["x"].to_list() == list(range(11, 20)) + assert out["ri"].to_list() == list(range(11, 20)) + + +def test_scan_with_include_file_paths_and_file_col_mixed_filter(tmp_path: Path) -> None: + """PR-2.8 cycle 2: include_file_paths + file col mixed filter pushes file part. + + The predicate is split per-minterm: ``x > 5`` pushes to Vortex; the + ``pl.col("src").str.ends_with("a.vortex")`` minterm stays residual. + Note: ``include_file_paths`` populates ``src`` with the FULL path (per + ``ScanSourceRef::to_include_path_name`` at + ``crates/polars-plan/src/dsl/scan_sources.rs``), not the basename. The + discriminator must therefore anchor on a substring that does NOT appear in + pytest's ``tmp_path`` parent directory; ``str.ends_with("a.vortex")`` + anchors on the basename suffix and is robust across CI environments. + + A regression where the per-column split mis-classified ``src`` as a file + column would surface as a ``ComputeError`` from Vortex bailing on missing + column 'src'. + """ + a = tmp_path / "a.vortex" + b = tmp_path / "b.vortex" + pl.DataFrame({"x": [1, 3, 5, 7, 9]}).write_vortex(a) + pl.DataFrame({"x": [2, 4, 6, 8, 10]}).write_vortex(b) + + out = ( + pl.scan_vortex([a, b], include_file_paths="src") + .filter((pl.col("x") > 5) & pl.col("src").str.ends_with("a.vortex")) + .collect() + ) + # a.vortex's x: 1,3,5,7,9 → x > 5 → 7, 9 (src ends with "a.vortex" → match) + # b.vortex's x: 2,4,6,8,10 → x > 5 → 6, 8, 10 (src ends with "b.vortex" → no match) + assert out.shape == (2, 2) + assert sorted(out["x"].to_list()) == [7, 9] + assert all(s.endswith("a.vortex") for s in out["src"].to_list()) + + +def test_scan_with_starts_with_wildcard_in_needle(tmp_path: Path) -> None: + r"""PR-2.7 cycle 1 negative path: '%' wildcard in needle refuses pushdown. + + ``bytes_to_like_literal`` rejects %/_/\\ to avoid LIKE-semantics widening. + The residual filter reapplies post-decode for correctness; a regression + dropping the wildcard guard would *widen* the predicate (Vortex LIKE + interprets '%' as match-any). + Correctness must hold either way (the residual is the safety net), so + the assertion focuses on the result: only rows containing the literal + '100%' string match. + """ + path = tmp_path / "wildcard_needle.vortex" + df = pl.DataFrame({"s": ["100% pure", "1000 hits", "absolute 100%", "no match"]}) + df.write_vortex(path) + + out = pl.scan_vortex(path).filter(pl.col("s").str.starts_with("100%")).collect() + # Only "100% pure" starts with the literal "100%". "1000 hits" must not + # match (would if '%' were interpreted as LIKE wildcard). + assert out["s"].to_list() == ["100% pure"]