Skip to content

feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches#2

Open
lwwmanning wants to merge 146 commits into
mainfrom
vortex-integration-phase-1
Open

feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches#2
lwwmanning wants to merge 146 commits into
mainfrom
vortex-integration-phase-1

Conversation

@lwwmanning
Copy link
Copy Markdown
Member

@lwwmanning lwwmanning commented May 18, 2026

Native Vortex file-format support for Polars, as a peer of Parquet. Read and write across local, cloud, and in-memory sources; schema discovery; multi-file scans with hive partitioning and missing_columns policy; projection pushdown; filter pushdown for equality, range, IN-list, and LIKE prefix/suffix predicates; positive and negative slice pushdown; per-scan and process-global tuning of Vortex's decompressed-segment cache; and per-file statistics populated from the Vortex footer so the mem-engine prunes whole files before the streaming source opens them.

Code lives almost entirely in a new crates/polars-vortex/ crate so leaf crates don't pull Vortex's dep graph; the cross-crate touchpoints are additive and feature-gated behind vortex. The default build is unaffected. A Criterion bench at crates/polars/benches/io_vortex.rs anchors a perf baseline for the AExpr-direct filter pushdown work in #4, which stacks on this branch.

🤖 Generated with Claude Code

lwwmanning and others added 30 commits May 13, 2026 18:33
Adds a public `handle()` method that returns an owned `tokio::runtime::Handle`
to Polars' global `ASYNC` Tokio runtime. This is a one-line precursor to the
Vortex integration: downstream libraries (e.g., Vortex) accept a Tokio handle
directly so they can share Polars' runtime instead of spinning up their own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation

Creates the foundational `polars-vortex` crate that will host the Vortex
file-format reader and writer. Mirrors the structure of `polars-parquet` —
a peer crate that wraps the Vortex APIs (`vortex-file`, `vortex-scan`,
`vortex-layout`, etc.) behind a Polars-friendly facade.

What's in this commit:

* New `crates/polars-vortex/` crate, gated on a future `vortex` feature.
* `session::session()` — a process-global `VortexSession` whose runtime is
  Polars' global `ASYNC` Tokio runtime (via a `vortex_io::runtime::Handle`
  built from `ASYNC.handle()`). Avoids a second Tokio runtime and the
  single-threaded `CurrentThreadRuntime` bottleneck.
* `session::segment_cache()` — a process-global Moka-backed segment cache
  shared by every Vortex scan, sized via `POLARS_VORTEX_CACHE_BYTES`
  (default 512 MiB; `0` disables). Cross-query reuse of decompressed
  segments is one of Vortex's biggest wins over Parquet.
* `read::options::VortexScanOptions` — the option struct that will live in
  the new `FileScanIR::Vortex` variant. Kept compact for the IR's
  size-80 assertion. Tunables: `schema` (skip-inference shortcut),
  `use_statistics`, `push_predicate`, `push_projection`, `cache` mode,
  `aggressive_pushdown`, `scan_concurrency`, `initial_read_size`.
* `read::options::VortexCacheMode` — Global / Off / Dedicated(bytes).
* `write::options::{VortexWriteOptions, VortexLayoutKind, VortexCompression}`
  — write-side options stub, filled in by the writer PR.
* Placeholder modules for `read::{metadata, predicate, projection,
  read_at, schema}` and `write::*` to be implemented in subsequent PRs.

Workspace wiring:

* `vortex` added to workspace deps as a path dep pointing at a sibling
  `/Users/will/git/vortex/` checkout. For a public release this would
  point at a published crates.io version.
* `polars-vortex` added to workspace deps.
* Cargo features: `cloud` (object_store integration), `serde`, `dsl-schema`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng engine

End-to-end plumbing for a `pl.scan_vortex(...)` user-facing API, gated on the
new `vortex` feature flag. After this commit, the LazyFrame -> IR -> streaming
engine path is structurally complete: a `FileScanIR::Vortex` flows through
optimization, scan-source expansion, hive-partition discovery, slice/predicate
pushdown setup, and physical planning. The streaming source node itself is a
stub that errors at runtime ("not yet implemented") — landing the actual
open/scan/decode loop is the next sub-PR.

`cargo check -p polars --features vortex` succeeds. Default build is
unaffected (the `vortex` feature is opt-in everywhere).

What's wired:

* `FileScanDsl::Vortex { options }` and `FileScanIR::Vortex { options,
  metadata }` variants in `polars-plan/src/dsl/file_scan/mod.rs`, with
  `_file_scan_eq_hash::FileScanEqHashWrap::Vortex` for the manual Eq/Hash
  trait (pointer-eq on the cached Footer, matching Parquet/IPC).
* `FileScanIR::flags()` returns `SPECIALIZED_PREDICATE_FILTER` for Vortex
  (lets the optimizer push `ColumnPredicateExpr` variants — Equal/Between/
  EqualOneOf/StartsWith/EndsWith/RegexMatch — which Vortex's expression IR
  fully covers).
* `FileScanIR::streamable()` returns true for Vortex.
* `DslBuilder::scan_vortex` and the `scans::vortex_file_info` helper in the
  DSL -> IR conversion pass. Schema inference is stubbed (returns a clear
  error directing users to pass `schema=...`); the file-open path is
  populated by a follow-up that ships the Vortex-DType -> polars-arrow
  schema converter.
* `LazyFrame::scan_vortex` + `scan_vortex_sources` + `scan_vortex_files`
  user-facing entry points in `polars-lazy/src/scan/vortex.rs`, with a
  full `ScanArgsVortex` mirroring `ScanArgsParquet` plus Vortex-specific
  knobs (`use_statistics`, `push_predicate`, `push_projection`,
  `aggressive_pushdown`, `initial_read_size`, `scan_concurrency`,
  `segment_cache`).
* `polars-stream/src/nodes/io_sources/vortex/{mod,builder}.rs` with a
  `VortexReaderBuilder` (`FileReaderBuilder` impl) and `VortexFileReader`
  (`FileReader` impl). Capabilities advertised: `ROW_INDEX | PRE_SLICE |
  PARTIAL_FILTER | MAPPED_COLUMN_PROJECTION`. The `initialize()` and
  `begin_read()` methods currently return "not yet implemented" errors —
  the next sub-PR fills in the open/scan/morsel loop.
* `lower_ir.rs` lowers `FileScanIR::Vortex` to `VortexReaderBuilder`,
  alongside the existing Parquet/IPC/CSV/etc. arms.
* `slice_pushdown_lp.rs` allows slice pushdown for Vortex scans.
* `polars-mem-engine/src/scan_predicate/functions.rs` zeros out the
  Vortex metadata field when invalidating the predicate cache (mirrors
  the Parquet/IPC behavior).
* `polars-plan/src/dsl/scan_sources.rs` adds Vortex to the
  `expand_paths_with_hive_update` feature gate.
* `polars-vortex/src/lib.rs` re-exports `::vortex` so downstream Polars
  crates can refer to upstream Vortex types (`vortex::file::Footer`) via
  `polars_vortex::vortex::...` without taking a direct `vortex` dep.

Feature-flag wiring:

* `polars-plan`: adds `vortex = ["polars-vortex"]`, with `polars-vortex`
  as an optional dep.
* `polars-lazy`: adds `vortex = ["dep:polars-vortex", "polars-plan/vortex",
  "polars-mem-engine/vortex", "polars-stream?/vortex"]`.
* `polars-stream`: adds `vortex = ["polars-mem-engine/vortex",
  "polars-plan/vortex", "dep:polars-vortex"]`.
* `polars-mem-engine`: adds `vortex = ["polars-plan/vortex"]`.
* `polars` (umbrella): adds `vortex = ["polars-lazy?/vortex",
  "new_streaming"]`.

What's deliberately left for follow-up PRs (PR-2c/d and PR-3+):

* The actual streaming open/scan/morsel loop in `VortexFileReader`
  (uses `VortexOpenOptions::open`, `ScanBuilder::into_stream`, the
  `PolarsInstrumentedVortexReadAt` decorator).
* Vortex `DType` -> polars-arrow `ArrowSchema` converter (so
  `vortex_file_info` can do real schema inference).
* `polars_expr_to_vortex` predicate convertor (PR-3).
* Cloud `VortexReadAt` over `polars_io::cloud::build_object_store` (PR-5).
* Eager / streaming writer (`sink_vortex`, PR-9/10).
* Python bindings (PR-12).

See `.claude/plans/we-want-to-scope-transient-patterson.md` for the full
implementation plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-open

Lights up real schema discovery for `pl.scan_vortex(...)`. After this commit,
`scan_vortex(path).collect_schema()` works end-to-end on local files (and
in-memory buffers). The `.collect()` morsel-emitting path is still pending —
that's the final PR-2 sub-task.

What's new:

* **`polars-vortex/src/read/schema.rs`** — Vortex `DType` → polars-arrow
  `ArrowSchema` + Polars `Schema` translation. The two Arrow crates are
  nominally distinct (polars-arrow is the internal fork), so we walk Vortex's
  `DType` enum recursively and emit polars-arrow types directly, bypassing
  the upstream-arrow intermediate that `DType::to_arrow_schema` would produce.
  Covers Null, Bool, all PType variants, Decimal/Decimal256, Utf8View,
  BinaryView, List, FixedSizeList, Struct, and the temporal extension types
  (Timestamp/Date32/Date64/Time32/Time64). Union, Variant, and unknown
  extensions bail with a clear error.

* **`polars-vortex/src/read/read_at.rs`** — `PolarsInstrumentedVortexReadAt`
  decorator over `vortex::io::VortexReadAt`. Each `read_at` runs inside
  `polars_io::pl_async::with_concurrency_budget(1, …)` (so Vortex shares
  Polars' global concurrency cap with Parquet et al.) and threads bytes
  through `OptIOMetrics::record_io_read` for observability. Zero buffer
  copies — the inner reader's `BufferHandle` passes through. Also exposes
  `local_file_read_at(path, metrics)` and `in_memory_read_at(bytes, uri,
  metrics)` factories.

* **`polars-stream/src/nodes/io_sources/vortex/mod.rs`** —
  `VortexFileReader::initialize()` now actually opens the file via
  `VortexOpenOptions`, attaches the process-global segment cache, caches
  the `Footer` (so re-conversions of the IR are cheap), and exposes the
  schema through `file_schema()` / `file_arrow_schema()` /
  `fast_n_rows_in_file()` (the trait hooks the multi-scan layer uses
  before `begin_read`). `begin_read()` itself is still a stub —
  `initialize()` succeeding is enough for schema-only consumers, plus it
  fully exercises the open path which validates the runtime wiring.

* **`polars-plan/src/plans/conversion/dsl_to_ir/scans.rs`** — real
  `vortex_file_info`. Opens the first source via Polars' shared session,
  pulls the DType through the schema converter, populates `FileInfo`
  (schema + reader_schema + row_estimation), caches `Arc<Footer>` for
  reuse, and inserts the row-index column when requested. The previous
  stub (which forced users to pass `schema=…`) is gone — schema
  inference works the same way Parquet/IPC do.

Verified: `cargo check -p polars --features vortex,parquet` passes
(~6 s warm). Default build unaffected.

Cargo: enables `polars-io/async` for `polars-vortex` (the `pl_async`
module that backs `with_concurrency_budget`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…copy bridge

Implements `VortexFileReader::begin_read()`: opens the scan, drives Vortex's
`ScanBuilder::into_array_stream()`, converts each `ArrayRef` into an upstream
`RecordBatch` via `ArrowArrayExecutor::execute_record_batch`, bridges that
zero-copy into a Polars `DataFrame`, and emits morsels with backpressure. After
this commit, `pl.scan_vortex(path).collect()` actually reads data.

The bridge: `polars-vortex/src/read/array_bridge.rs`

Vortex emits upstream-Arrow types (`arrow_array::RecordBatch`); Polars uses its
internal-fork `polars-arrow`. Both crates implement the *same* Arrow C Data
Interface struct layout (`#[repr(C)]`, 9 identical fields, per spec), so we move
between them via the C ABI:

  upstream array
    -> `arrow_array::ffi::to_ffi(&ArrayData)` -> `FFI_ArrowArray`
    -> `mem::transmute` (verified at compile time via `size_of` assertion)
    -> polars-arrow `ArrowArray`
    -> `polars_arrow::ffi::import_array_from_c(array, dtype)` -> `Box<dyn Array>`
    -> `Series::from_arrow(name, array)` -> Polars `Series` -> `Column` -> `DataFrame`

No buffer copies — only the ~80-byte C-ABI struct is moved. The upstream array's
backing buffers stay in place; reference counts on both sides keep them alive
until polars-arrow drops the import.

This matches what `vortex-duckdb` does (DuckDB extension framework does the same
C-ABI handoff), and is the standard cross-crate Arrow interop pattern.

The streaming reader (`polars-stream/src/nodes/io_sources/vortex/mod.rs`):

* `InitializedState` now caches both polars-arrow `ArrowSchema` and upstream
  `arrow_schema::Schema` (needed by `execute_record_batch`), plus the per-column
  polars-arrow `ArrowDataType` vector (handed to the bridge each batch).
* `begin_read` builds a `ScanBuilder` from the cached `VortexFile`, applies
  `pre_slice` as a positive `row_range` (negative slice and predicate pushdown
  are PR-3/PR-4), spawns a `TaskPriority::Low` task on the streaming
  `async_executor`, and pumps morsels via `FileReaderOutputSend::new_serial`
  with proper backpressure (`send_morsel`'s wait_group).

What works now:
* `pl.scan_vortex(path).collect()` on local files and in-memory buffers.
* `pl.scan_vortex(path).slice(offset, len)` — pre_slice pushed to ScanRequest.
* Schema discovery via `vortex_file_info` (the previous commit).

What's still pending: PR-3 (filter pushdown via AExpr -> Vortex Expression),
PR-4 (negative slice + row-index), PR-5 (cloud), the writer (PR-9/10), Python
(PR-12), benches (PR-14).

Cargo: adds `arrow-array = "58"`, `arrow-data = "58"`, `arrow-schema = "58"`
to `polars-vortex` for the bridge. Pinned to the same major Vortex pulls in
transitively, so no duplicate-crate risk.

Verified: `cargo check -p polars --features vortex` (0.14 s warm),
`cargo check -p polars` (default, no vortex) — both pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pression

Adds the AExpr / specialized-predicate → Vortex `Expression` convertor and wires it
into `VortexFileReader::begin_read`. For predicates the Polars optimizer can extract
as `SpecializedColumnPredicate`, we now hand a Vortex filter to `ScanBuilder::with_filter`
instead of materializing every batch and applying the residual post-decode. This is
the meat of the "maximal Vortex pushdown" goal from the plan.

The convertor (`polars-vortex/src/read/predicate.rs`):

* Walks `ScanIOPredicate::column_predicates.predicates` — Polars exposes per-column
  `(PhysicalIoExpr, Option<SpecializedColumnPredicate>)`. We pattern-match the
  specialized variant (no need to crack open the opaque PhysicalIoExpr).
* Translates the structured forms into Vortex Expression trees:
  - `Equal(scalar)`        → `eq(get_item(col, root()), lit(scalar))`
  - `Between(lo, hi)`      → `and(gt_eq(col, lo), lt_eq(col, hi))` (closed range)
  - `EqualOneOf(scalars)`  → `or_collect` of `eq(col, lit(s))` — all-or-nothing on
    scalar conversion so we never push a *narrower* filter than the user wrote.
  - `StartsWith` / `EndsWith` / `RegexMatch` → not pushed yet; tracked under PR-3
    follow-ups (Vortex's `like` needs separate `LikeOptions` plumbing).
* `polars_scalar_to_vortex` covers the primitive AnyValue variants
  (bool / Int8..Int64 / UInt8..UInt64 / Float32 / Float64 / String / Binary). Unknowns
  return `None` and that column's specialized predicate stays as residual.

The wiring (`polars-stream/src/nodes/io_sources/vortex/mod.rs`):

* `begin_read` calls `polars_to_vortex_predicate(scan_predicate)` gated on
  `VortexScanOptions.push_predicate` (default `true`). The returned `Expression` is
  handed to `ScanBuilder::with_filter`.
* We continue to advertise `ReaderCapabilities::PARTIAL_FILTER`, so the multi-scan
  layer keeps the original `ScanIOPredicate::predicate` around and re-applies it to
  emitted morsels. Pushing a *partial* predicate is safe: Vortex prunes/filters what
  it can, the residual catches the rest.

What this unlocks:

* `pl.scan_vortex(path).filter(pl.col("a") == 42).collect()` — `a == 42` becomes a
  Vortex filter expression. Vortex's `LayoutReader::pruning_evaluation` (running
  inside `ScanBuilder`) uses zone statistics to skip whole chunks where `a == 42`
  is impossible. This is the big perf win over the previous commit's "decode
  everything, then filter post-decode" pattern.
* `filter(pl.col("a").is_between(0, 100))`, `filter(pl.col("a").is_in([1, 2, 3]))` —
  same.

Still residual (multi-scan layer applies post-decode):

* String pattern predicates (StartsWith/EndsWith/RegexMatch) — Vortex's `like`
  builder takes `LikeOptions` we haven't plumbed yet; one follow-up.
* Multi-column predicates (e.g. `col("a") > col("b")`) and arithmetic — the
  optimizer doesn't lift these into `SpecializedColumnPredicate`, so they reach us
  only as `PhysicalIoExpr`. Lifting them is a larger PR that walks `AExpr` at IR
  conversion time.

Verified: `cargo check -p polars --features vortex` passes (0.14 s warm).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds string-prefix and string-suffix pushdown to the predicate convertor. Polars
emits `SpecializedColumnPredicate::StartsWith(bytes)` for `pl.col("x").str.starts_with(...)`
and similarly for EndsWith; we convert to `LIKE 'prefix%'` / `LIKE '%suffix'` and
hand to Vortex's `like` expression builder.

Safety: we refuse pushdown when the bytes contain SQL-LIKE wildcards (`%`, `_`)
or the escape (`\`). LIKE would *widen* the pattern in those cases — which is
still correct (the multi-scan layer always re-applies the original predicate
post-decode), just wasteful of I/O. Rather than push a wasteful pattern, we
leave that to the residual filter.

Now pushable: `filter(pl.col("name").str.starts_with("foo"))`,
`filter(pl.col("name").str.ends_with(".csv"))`.

Verified: `cargo check -p polars-vortex` passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…adAt

Adds S3 / GCS / Azure / HTTP support to `scan_vortex`. After this commit,
`pl.scan_vortex("s3://bucket/key.vortex", storage_options={...})` works.

Implementation:

* `polars-vortex/src/read/read_at.rs::cloud_read_at(path, cloud_options, io_metrics)`
  — async factory gated on `cfg(feature = "cloud")`. Goes through Polars'
  `polars_io::cloud::build_object_store` so all of `CloudOptions` (auth, retry
  config, endpoint overrides, credential providers) is honored. The resulting
  `Arc<dyn ObjectStore>` is handed to `vortex::io::object_store::ObjectStoreReadAt`
  along with the runtime handle, then wrapped in `PolarsInstrumentedVortexReadAt`
  for the standard concurrency-budget + IOMetrics instrumentation.

* `polars-stream/src/nodes/io_sources/vortex/mod.rs` — `build_read_at` gains a
  `#[cfg(feature = "cloud")]` arm for cloud paths; the `cfg(not(feature = "cloud"))`
  fallback emits a clear "rebuild with --features vortex,cloud" message.

* `object_store` added to `polars-vortex` as an optional dep, gated on the
  `cloud` feature. The version is workspace-pinned so it resolves to the same
  fork (`kdn36/arrow-rs-object-store`) Polars patches in via `[patch.crates-io]`.
  Vortex's transitively-pulled `object_store` resolves to the same patched
  version, so the `Arc<dyn ObjectStore>` types are identical across the bridge —
  no duplicate-crate risk.

Feature-flag propagation:

* `polars-lazy/cloud` now also enables `polars-vortex?/cloud` (optional, only
  fires when `polars-vortex` is itself in the dep graph).
* `polars-stream/cloud` similarly.
* User-facing: `cargo check -p polars --features vortex,cloud`.

Verified: `cargo check -p polars --features vortex,cloud` passes (10 s warm),
`cargo check -p polars --features vortex` (local-only) still passes, and the
default build is unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing Python API for Vortex reads. After this commit:

    import polars as pl
    lf = pl.scan_vortex("data.vortex").filter(pl.col("a") == 42).select("b", "c")
    df = lf.collect()

Wires through the full feature surface from the prior commits: schema discovery,
predicate pushdown (Equal/Between/EqualOneOf/StartsWith/EndsWith), cloud reads
via storage_options, segment cache, slice pushdown.

Rust side (`polars-python`):

* `PyLazyFrame::new_from_vortex` — pyo3 staticmethod taking sources, schema,
  scan_options, and Vortex-specific knobs (use_statistics, push_predicate,
  push_projection, aggressive_pushdown, initial_read_size, scan_concurrency).
  Builds a `VortexScanOptions` and goes through `DslBuilder::scan_vortex`.
* `crates/polars-python/src/lazyframe/visitor/nodes.rs::scan_type_to_pyobject`
  — adds the `FileScanIR::Vortex` arm so the IR visitor exposes Vortex scans
  to Python introspection (used by EXPLAIN, etc.).
* New `polars-python` feature `vortex` that gates the bindings.

Python side (`py-polars/src/polars/io/vortex/`):

* `functions.py::scan_vortex` — `pl.scan_vortex(...)`. Docstring covers the
  Vortex-specific tunables and points users at `storage_options` for cloud.
* `functions.py::read_vortex` — `pl.read_vortex(...)`, a thin
  `scan_vortex(...).collect()` wrapper for the eager API.
* `__init__.py` re-exports both.
* `polars/io/__init__.py` and top-level `polars/__init__.py` add the new
  symbols to their public imports and `__all__` lists, so `import polars as pl;
  pl.scan_vortex(...)` works.

Wiring up the feature flags surfaced two unrelated issues that needed fixing:

* `polars-plan/serde` now propagates to `polars-vortex?/serde`. Without this,
  `VortexScanOptions` (which lives inside `FileScanIR::Vortex`) doesn't
  satisfy `Serialize` / `Deserialize` when `polars-plan`'s serde is on.

* `polars-plan/src/plans/optimizer/expand_datasets.rs:282` had a non-exhaustive
  match on `FileScanDsl` that didn't include the Vortex variant. Added the arm.

* `polars-sql/src/context.rs::execute_select` — when `vortex` is enabled, Vortex's
  transitive deps pull in `hashbrown 0.17` alongside Polars' workspace-pinned
  `hashbrown 0.16`, breaking type inference on three `PlHashSet::new()` /
  `PlHashMap::with_capacity()` calls. Added explicit `<&str>` annotations so
  inference succeeds regardless of which hashbrown is in scope. Identical
  behavior, just more explicit.

Verified: `cargo check -p polars-python --features vortex` passes (0.20 s warm).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-trips Polars DataFrames into Vortex files using the same C-ABI
zero-copy trick as the read path. After this commit:

    use polars_vortex::write::write_vortex;
    let df = pl.read_vortex("input.vortex");
    write_vortex(&df, "output.vortex", &VortexWriteOptions::default())?;

The reverse bridge (`polars-vortex/src/write/array_bridge.rs`):

* `polars_array_to_upstream(Box<dyn Array>, &Field)` — moves one polars-arrow
  column into an upstream `arrow_array::ArrayRef` via `export_array_to_c` +
  `mem::transmute` + `arrow_array::ffi::from_ffi`. Zero buffer copy. Same
  9-field `#[repr(C)]` layout argument as the read-side import; compile-time
  `size_of` asserts on the FFI struct identity.

* `polars_chunk_to_upstream_record_batch(Vec<Box<dyn Array>>, &Schema)` —
  builds an upstream `RecordBatch` from a single polars-arrow chunk of columns.

DataFrame -> Vortex chunks (`write/df_to_stream.rs`):

* `dataframe_to_vortex_chunks(&DataFrame) -> (DType, Vec<ArrayRef>)` —
  iterates the DataFrame's chunked columns, bridges each chunk to an upstream
  `RecordBatch`, wraps as a `StructArray`, then ingests via Vortex's
  `FromArrowArray<&StructArray>`. The top-level Vortex `DType::Struct` is
  derived once from the first chunk's upstream schema using Vortex's
  `FromArrowType<&Schema>`.
* Refuses misaligned chunks (different chunk counts per column) with a clear
  error telling the user to call `df.rechunk()` first.

The writer entry point (`write/writer.rs`):

* `write_vortex(&df, path, &options)` — opens a tokio file, wraps the chunks
  as an `ArrayStreamAdapter`, and drives `VortexWriteOptions::write` on
  Polars' global ASYNC Tokio runtime via `ASYNC.block_on`. Default
  `VortexWriteOptions` produces a BtrBlocks-compressed Zoned layout.

What's pending in the write path:

* `LazyFrame::sink_vortex` and `FileWriteFormat::Vortex` — needs an IR variant
  similar to the read-side variant (FileScanIR::Vortex), the `FileWriterStarter`
  impl in `polars-stream/src/nodes/io_sinks/writers/`, and the lower_ir.rs arm.
  Tracked as PR-10 / PR-11 in the plan.
* Schema-side dtype coverage in `df_to_stream::dataframe_to_vortex_chunks` —
  the C-ABI bridge handles every dtype at the array level (it just moves bytes),
  but the top-level DType derivation goes through Vortex's
  `FromArrowType<&Schema>` which already handles primitives, decimal, utf8,
  binary, list, struct, fixed-size list, and the temporal extensions. So this
  *should* be feature-complete; tested cases are TBD.
* Python bindings for `df.write_vortex(path)` / `lf.sink_vortex(path)` —
  follow-up to PR-12.

Verified: `cargo check -p polars --features vortex` passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…WriteFormat::Vortex

After this commit, `lf.sink_vortex("out.vortex")` works from Python: the streaming
engine drains morsels, converts each DataFrame chunk into a Vortex `ArrayRef`
zero-copy via the reverse C-ABI bridge, and feeds them into `VortexWriteOptions::write`
on the global ASYNC runtime. This is the streaming analogue of the eager
`polars_vortex::write::write_vortex` from PR-9.

What's wired:

* `FileWriteFormat::Vortex(Arc<VortexWriteOptions>)` variant added in
  `polars-plan/src/dsl/options/mod.rs`, including the `extension()` arm.

* `polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs::VortexWriterStarter`
  — `FileWriterStarter` impl. `start_file_writer`:
  - Derives the top-level Vortex `DType` from the file schema upfront via
    `polars_schema_to_vortex_dtype` (new helper in `polars-vortex` — converts
    polars `Schema` → polars-arrow `ArrowSchema` → upstream `arrow_schema::Schema`
    field-by-field via the C-ABI struct transmute → Vortex `DType` via
    `FromArrowType<&Schema>`).
  - Spawns a producer task that drains `morsel_rx` and converts each `DataFrame`
    to Vortex `ArrayRef`s via the C-ABI bridge, sending through a bounded
    `futures::channel::mpsc` for backpressure.
  - Wraps the channel as an `ArrayStreamAdapter` and calls
    `VortexWriteOptions::write(tokio_file, stream)` on `ASYNC.spawn(…)`.

* `create_file_writer_starter` arm in
  `polars-stream/src/nodes/io_sinks/writers/mod.rs` — picks the
  `VortexWriterStarter` when the `FileWriteFormat::Vortex` variant is present.

* Two non-exhaustive-match arms in `polars-stream/src/physical_plan/fmt.rs`
  (FileSink + PartitionedSink labels in EXPLAIN output).

* `PyLazyFrame::sink_vortex` in `polars-python` — staticmethod, mirrors
  `sink_parquet` and packages the call into `self.ldf.sink(...,
  FileWriteFormat::Vortex(...), ...)`.

* `LazyFrame.sink_vortex(path, ...)` in `py-polars/src/polars/lazyframe/frame.py`
  — user-facing Python API. Mirrors `sink_parquet` minus the Parquet-specific
  knobs (compression, row groups, etc., which are baked into Vortex's layout
  strategy).

Current limits / follow-ups:

* The sink only writes to local file paths; `Writeable::Cloud` paths error with
  a clear message directing users to write locally and upload. Wiring cloud
  sinks needs a `vortex-io::tokio::AsyncWrite` adapter for the cloud writer —
  parallel to the read-side cloud bridge in PR-5.
* `VortexWriteOptions` is currently passed with defaults; surfacing `layout` /
  `compression` knobs through Python is the next polish PR.
* Partitioned writes (`SinkTypeIR::Partitioned`) should work once the basic
  sink lands, but that's untested here.

Verified: `cargo check -p polars --features vortex,cloud,parquet` passes
(0.14 s warm). Default build unaffected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che knob

User-facing tunable for the process-global Vortex segment cache (defaults to
512 MiB, also tunable via POLARS_VORTEX_CACHE_BYTES env). After this commit:

    import polars as pl
    pl.set_vortex_cache_bytes(2 * 1024**3)   # 2 GiB
    pl.set_vortex_cache_bytes(0)             # disable

Wires straight through to `polars_vortex::session::set_global_cache_bytes`,
which atomically swaps the global `Arc<dyn SegmentCache>` (MokaSegmentCache vs
NoOpSegmentCache).

* `polars-python/src/functions/meta.rs::set_vortex_cache_bytes` — pyfunction
  gated on `feature = "vortex"`.
* Registered in `polars-python/src/c_api/mod.rs` alongside the other meta
  functions.
* `py-polars/src/polars/io/vortex/functions.py::set_vortex_cache_bytes` —
  Python wrapper with docstring.
* Re-exported from `polars.io.vortex` and the top-level `polars` namespace
  (so `pl.set_vortex_cache_bytes(...)` works).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`pl.scan_vortex(path).tail(N).collect()` and `pl.scan_vortex(path).slice(-N, M)`
now push the slice through to Vortex's `ScanBuilder::with_row_range` instead of
materializing every row and slicing post-decode.

Implementation is trivial because the file's row count is already cached in
`InitializedState` (it comes from the footer, free). Negative slices are
converted to positive via `polars_utils::slice_enum::Slice::restrict_to_bounds`
— the same helper Polars uses elsewhere. Then we hand the positive `Range<u64>`
to `ScanBuilder::with_row_range` like any other slice.

`VortexReaderBuilder` now advertises `ReaderCapabilities::NEGATIVE_PRE_SLICE`
alongside `PRE_SLICE`, so the multi-scan layer stops splitting negative slices
across files when the Vortex reader can handle them directly.

The only remaining capability gap vs. Parquet is `EXTERNAL_FILTER_MASK` — that
needs Vortex `Selection`-bitmap plumbing and is tracked under PR-13.

Verified: `cargo check -p polars-stream --features vortex` passes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the convenience eager API:

    import polars as pl
    df = pl.read_vortex("input.vortex").filter(pl.col("a") > 0)
    df.write_vortex("output.vortex")

Implemented as `self.lazy().sink_vortex(...)` with eager optimizations and
the streaming engine, mirroring how `df.write_ipc` and `df.write_csv` work.
The cloud sink path still errors at the Rust layer (write-side cloud is
pending); the `storage_options` / `credential_provider` parameters are
accepted for API symmetry but will surface that error when used.

This completes the Python write surface. After this commit the user-facing
API matrix is:

  Read:  pl.scan_vortex(path)                 (lazy)
         pl.read_vortex(path)                 (eager)
  Write: lf.sink_vortex(path)                 (streaming, lazy)
         df.write_vortex(path)                (eager, internally streaming)
  Config: pl.set_vortex_cache_bytes(N)        (segment-cache knob)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a comprehensive README.md for the new `polars-vortex` crate covering:

* Quickstart — Rust dep snippet + Python end-to-end examples (read, filter, slice,
  write, set_vortex_cache_bytes).
* "Why Vortex" — what makes the integration substantively different from "another
  Parquet": filter pushdown, negative-slice pushdown, decompressed-segment cache
  reuse.
* "How it plugs into Polars" — the full architectural pipeline as a diagram, from
  `LazyFrame::scan_vortex(...)` through DSL→IR conversion, mem-engine delegation
  to polars-stream, `VortexReaderBuilder` → `VortexFileReader` → C-ABI bridge →
  Morsel emission. Sink path mirrored.
* The four key design decisions (one Tokio runtime, C-ABI zero-copy bridge, the
  PolarsInstrumentedVortexReadAt decorator, ColumnPredicates → Vortex Expression)
  with code excerpts and rationale.
* Cargo features table.
* Configuration — segment cache (process-global, env-tunable), concurrency budget,
  cloud auth via `storage_options`.
* Pushdown coverage table at a glance: ✅ for projection / pos+neg slice /
  Equal/Between/IN/StartsWith/EndsWith / zone-level pruning / hive / schema
  evolution / row index; ❌ residual for regex / arithmetic / CAST / struct field
  access (with reasons + tracking).
* Crate layout diagram + IR/DSL touchpoint inventory.
* Public API surface for both Rust and Python.
* "What works today" checklist.
* "Known limits / pending follow-ups" (cloud sink, write options plumbing through
  Python, aggressive AExpr pushdown, file-level optimizer stats).
* End-to-end test recipes (roundtrip, filter-pushdown verification via explain,
  warm-cache benchmark).
* "Pointers — reading the source" map for new contributors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bridge Polars' AsyncWriteable::Cloud (tokio::io::AsyncWrite) to Vortex's
VortexWrite via `tokio_util::compat::Compat` + `vortex::io::AsyncWriteAdapter`.

`VortexSink` is a small enum that the streaming sink hands to
`VortexWriteOptions::write` regardless of whether the underlying target is a
local `tokio::fs::File` (direct `impl VortexWrite`) or a cloud writer
(compat-wrapped through Vortex's `AsyncWriteAdapter`). Collapsing both into
one enum keeps `VortexWriteOptions::write`'s generic single-typed and lets
the sink writer code stay generic-free.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update README "what works" / "known limits" sections to reflect that
`lf.sink_vortex("s3://...")` now works alongside local sinks via the
`VortexSink` enum bridge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…size) to Python

`lf.sink_vortex(...)` and `df.write_vortex(...)` previously always used
`VortexWriteOptions::default()`. Now accept:

- `compression`: "btrblocks" (default, adaptive per-column) or "uncompressed"
- `layout`: "adaptive" (default), "flat", "chunked", or "zoned"
- `chunk_size`: rows-per-chunk target (None lets Vortex pick)
- `include_dtype`: whether to embed the Vortex DType in metadata (default True)

Strings are parsed in `PyLazyFrame::sink_vortex`, with a clear error message on
unknown values, so we don't expose Vortex's Rust enums through the Python ABI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p tests

Previously `VortexWriteOptions` was accepted by `write_vortex` and the streaming
sink but threaded through to nothing — both call sites built a fresh
`session.write_options()` from scratch. The Python API surfaced layout/
compression knobs that didn't change a byte on disk.

Reshape the options to match what Vortex's `WriteStrategyBuilder` actually
configures:
- `compression`: BtrBlocks (default, all schemes) vs Uncompressed
  (`BtrBlocksCompressorBuilder::empty()` — no schemes selected)
- `row_block_size`: granularity of zone-level pruning (default 8192 rows).
  Renamed from `target_chunk_size`; the old name was misleading.
- `include_dtype`: whether to embed the DType. Default `true` (the previous
  derived `Default` was silently `false`, which made reads fail with
  "doesn't embed a DType and none provided").

Drop `VortexLayoutKind` entirely — Vortex's strategy builder always produces a
layered Flat→Chunked→Buffered→Zoned strategy; there is no "select your layout
shape" knob to surface. Leaving it in the public API was misleading.

New `write/strategy.rs::build_write_options` is the single place that turns our
public options into Vortex's `VortexWriteOptions`. Used by both `write_vortex`
(eager) and the streaming sink.

Tests in `crates/polars-vortex/tests/roundtrip.rs` exercise default options,
uncompressed, and small-block configurations; the `include_dtype` bug was
caught by these. They live in polars-vortex (not crates/polars) so they don't
need the streaming engine on the read side — they open the file directly
through the Vortex session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests for:
- Schema preservation (verify field names match after open)
- Nullable Int32 and Utf8 columns
- Boolean
- Empty DataFrame (zero rows)
- include_dtype=false (smaller files; readers must supply DType out-of-band)

Total roundtrip suite now 8 tests; all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
py-polars/tests/unit/io/test_vortex.py covers the user-facing API end-to-end
via the Python → Rust → streaming engine pipeline:

- Basic roundtrip (int / float / str columns)
- Nullable columns
- Uncompressed compression mode
- Small row_block_size
- Invalid compression error message
- scan + filter
- scan + projection
- scan + negative slice (tail)

A module-level skipif probes whether the binary was built with the `vortex`
feature by attempting a tiny write — Python `scan_vortex` is always exported,
so a hasattr check would be misleading.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- VortexWriteOptions: drop the fictional VortexLayoutKind enum reference,
  rename target_chunk_size → row_block_size, note manual Default impl
- Add `write/strategy.rs` to the crate-layout sketch
- Update sink_vortex / write_vortex signatures in the Python API section to
  show compression / row_block_size / include_dtype parameters
- Drop the "VortexWriteOptions are not yet plumbed to Python" known-limit
  entry, which is now resolved

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `read/predicate.rs`: tests cover the bytes_to_like_literal wildcard-safety
  check (rejects %, _, \\, and invalid UTF-8) and the polars_scalar_to_vortex
  primitive coverage (bool / int / float / string).

- `read/schema.rs`: tests cover the primitive PType → ArrowDataType mapping,
  Utf8/Binary → view variants, and the two top-level-Struct invariants we
  check at file-open time.

Unit suite is now 9 tests, complementing the 8-test integration suite in
tests/roundtrip.rs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "once the actual reader is implemented" comment was from the
skeleton-phase commit and is stale — the reader landed in commits pola-rs#5–8.
Replace with current-state explanation: why PARTIAL_FILTER is safe (full
predicate is re-applied by multi-scan), and what FULL_FILTER /
EXTERNAL_FILTER_MASK would require.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test workflow lists each crate explicitly. Adding -p polars-vortex
runs both the 9-test unit suite (predicate + schema convertor) and the
8-test integration suite (roundtrip) under --all-features (which enables
the cloud feature too).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… hardening

Addresses the multi-section review report (see plan §15) in one pass:

**Dead options removed from VortexScanOptions** (and matching API surface):
- `use_statistics`, `aggressive_pushdown`, `push_projection` were declared but
  never consulted anywhere on the Rust side, while exposed via Python with
  defaults that misled users. Removed everywhere — `polars-vortex`,
  `polars-lazy::ScanArgsVortex`, `polars-python::PyLazyFrame::new_from_vortex`,
  and the `pl.scan_vortex` / `pl.read_vortex` Python signatures.
- `scan_concurrency` was similarly declared but unused; now actually wired into
  `ScanBuilder::with_concurrency` in the streaming source.

**Cloud schema discovery** (`vortex_file_info` in polars-plan) now works for
`s3://` / `gs://` / `az://` paths — it routes through
`polars_vortex::read::read_at::cloud_read_at`, with a clean error if
polars-vortex was built without the `cloud` feature. Adds
`polars-vortex?/cloud` to the polars-plan `cloud` feature.

**Predicate scalar convertor** now handles `Date`, `Datetime`, `Time`, and
`Decimal` scalars (gated on the corresponding polars-core dtype-* features),
producing Vortex `Date` / `Timestamp` / `Time` extension scalars and
`DecimalValue::I128`. Polars `Duration` has no Vortex analogue → residual.
New polars-vortex features `dtype-date` / `dtype-datetime` / `dtype-time` /
`dtype-decimal` are propagated from `polars-lazy` so `polars/dtype-full`
turns them all on.

**C-ABI transmute hardening**:
- Strengthen compile-time checks: both `size_of` AND `align_of` asserts (size
  alone wouldn't catch alignment divergence between the two `#[repr(C)]`
  layouts).
- Runtime sanity check after every array import: the upstream-FFI `length`
  field is snapshotted before transmute, then compared against
  `imported.len()` after import. Catches a layout-incompatibility regression
  cleanly instead of letting it propagate as silent corruption.
- Better doc comments: the read-side bridge explains why a transmute is the
  practical option (polars-arrow's `ArrowArray` fields are `pub(super)` so we
  can't construct one field-by-field from outside) and how the release-callback
  handoff stays sound.
- The write-side bridges (per-array + schema-only in df_to_stream) get the
  same treatment.

**Tests now verify data, not just metadata**: roundtrip tests build a
DataFrame, write it, scan it back via Vortex's `into_array_stream()` +
`record_batch_to_dataframe` C-ABI bridge, and assert `equals_missing`. A
buffer-bridging bug would now surface as a test failure rather than silent
mismatch. Helper `read_back()` mirrors the streaming reader's path.

**Schema-alignment validation** in `record_batch_to_dataframe`: column
count is now a hard error (was `debug_assert`), and field-name parity gets
a `debug_assert` so a future reorder-bug surfaces with a clear message rather
than as a downstream dtype mismatch.

**No-panic write path**: `df_to_stream::dataframe_to_vortex_chunks` no
longer relies on `top_dtype.expect("at least one chunk")`. The top-level
dtype is derived from the schema upfront, so n_chunks == 0 falls through
cleanly. The per-column `chunks().get(...)` now returns a proper PolarsError
on miss instead of `.expect(...)`.

**Misc cleanups**:
- Drop the `read/projection.rs` stub file and three dead `let _ = ...` patterns.
- Drop the `use UpstreamSchema as _` and `use IOMetrics as _` no-op imports.
- Refresh stale module-level doc and a stale capability comment.
- Python `_vortex_available()` skipif now catches only `AttributeError` /
  `ComputeError` (not bare `except`), so genuine test bugs fail loud.
- `polars-python/vortex` feature now propagates `polars-vortex/serde` so the
  lazyframe visitor's JSON serialization of Vortex scan options compiles.
- Annotate the load-bearing `<&str>` type hints in polars-sql with a comment
  explaining the hashbrown 0.16 vs 0.17 type-inference conflict.

README synced with the current option set + pushdown coverage matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop stale `projection.rs` entry from the crate-layout sketch (file was
  removed in 33b56a8).
- Expand the "What works today" list with cloud schema discovery, the
  Date/Datetime/Time/Decimal scalar pushdown gated on `dtype-*` features, and
  the now-wired `scan_concurrency` knob.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che, row-count truncation, field metadata, naming)

Addresses bugs identified during the fresh-eyes second-pass review:

**Decimal negative-scale wraparound on read** (`read/schema.rs:71`):
Vortex's `dec.scale()` returns `i8` and Arrow spec allows negative scales, but
polars-arrow's `Decimal(usize, usize)` only represents non-negative scales. The
bare `as usize` cast would silently wrap a negative scale to a huge value.
Now explicitly errors with a clear message.

**Decimal precision/scale wraparound on write** (`read/predicate.rs::decimal_scalar`):
Vortex's `DecimalDType::new(u8, i8)` has narrower ranges than Polars'
`Decimal(usize, usize)`. Bare `as u8` / `as i8` casts silently wrapped for
out-of-range values. Now uses `try_into()` → returns `None` on overflow so the
convertor falls back to residual (always correct).

**`u64 as usize` row-count truncation** (`polars-stream vortex source`):
On 32-bit platforms, files larger than 2^32 rows would silently truncate when
computing the `restrict_to_bounds` argument for negative slices. Now clamped
via `usize::try_from(...).unwrap_or(usize::MAX)`.

**Dead `VortexScanOptions.cache` field** — now actually wired. Added
`VortexCacheMode::resolve()` that returns the right `Arc<dyn SegmentCache>`:
`Global` → the process-wide cache, `Off` → fresh NoOpSegmentCache, `Dedicated(N)`
→ fresh MokaSegmentCache. Streaming source's `initialize()` calls
`self.options.segment_cache.resolve()` instead of always hitting the global cache.
Field also renamed `cache` → `segment_cache` to avoid colliding with the
LazyFrame query-cache `bool` field in ScanArgsVortex.

**Field metadata dropped on write** (`write/array_bridge.rs`):
`polars_chunk_to_upstream_record_batch` rebuilt `Field::new(name, dtype, nullable)`
and threw away any custom metadata the polars-arrow Field carried. Now copies
per-field metadata over to the upstream Field via `with_metadata(...)`.

**Misc cleanups**:
- Wrong doc comment in `read/array_bridge.rs` (`import_array_from_c` takes the
  FFI struct, not `Box<dyn Array>`).
- Deterministic predicate iteration: sort by column name before `and_collect`-ing.
  Vortex's pruning evaluator may short-circuit left-to-right; non-deterministic
  hashmap order was a latent reproducibility hazard.
- Double `to_arrow` schema computation in write path: split
  `polars_schema_to_vortex_dtype` into a `_from_arrow` variant so the eager
  writer can pass through the schema it already built.
- Drop stale "Touch the global session" cargo-cult comment and a PR-3 reference
  that's now PR-13.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the test-coverage gap surfaced in the second review pass — the
roundtrip suite previously exercised only 5 dtypes (Int64, Float64, Utf8, Bool,
Int32/Utf8-nullable), and the predicate scalar helpers + schema converter
branches were largely untested.

**Unit tests for predicate scalar helpers** (`read::predicate::tests`):
- `date_scalar_is_vortex_date_days` — verifies Date scalars produce
  `DType::Extension(vortex.date)`.
- `datetime_scalar_units_map_correctly` — verifies all three Polars TimeUnits
  (Ns/Us/Ms) produce Vortex Timestamp scalars.
- `datetime_scalar_with_timezone` — verifies the tz round-trip.
- `time_scalar_is_nanoseconds` — verifies Polars Time → Vortex Time(Ns).
- `decimal_scalar_roundtrips_precision_and_scale` — verifies the DecimalDType.
- `decimal_scalar_rejects_overflowing_precision_and_scale` — locks in the new
  `try_into` behaviour (bug fix from c124312) by asserting that
  precision > 255 and scale > 127 return None.
- `convertor_returns_pushable_for_date_predicate` — end-to-end via
  `convert_specialized`.

**Schema converter tests** (`read::schema::tests`): adds the missing branches
of `vortex_dtype_to_arrow_dtype`:
- `null_dtype_maps_to_arrow_null`
- `bool_maps_to_arrow_boolean` (both nullabilities)
- `decimal_precision_chooses_128_or_256` (boundary at p == 38 vs 39)
- `decimal_with_negative_scale_errors_cleanly` (locks in c124312 bug fix)
- `list_maps_through_recursively`
- `fixed_size_list_maps_through_recursively`
- `struct_maps_to_arrow_struct` (nested fields, field-name + dtype assertions)
- `date_extensions_map_to_arrow_date32_and_date64` (both time units)
- `time_extensions_map_to_time32_or_time64` (all 4 time units)
- `timestamp_with_and_without_timezone` (no-tz + UTC, two units)
- `vortex_dtype_to_schema_round_trips_nullability` (field-level nullability)

**Integration roundtrip suite** (`tests/roundtrip.rs`): widens dtype coverage:
- All int and float widths reachable from `Column::new(...)` (Int32/64,
  UInt32/64, Float32/64). i8/i16/u8/u16 need polars-core's `dtype-i*` features
  which aren't part of the polars-vortex feature set; the schema-converter unit
  tests above cover those mappings.
- NaN preservation (shape + dtype, since NaN != NaN by IEEE).
- Binary (both non-null and nullable).
- Multi-chunk DataFrame: builds a 3-chunk series and asserts the writer
  correctly concatenates across chunks.
- Date (dtype-date), Datetime ×3 units no-tz (dtype-datetime), Time (dtype-time).

Suite is now 28 unit tests + 18 integration tests (+ 15 without dtype features)
= 46 Rust tests. All green under `--features dtype-date,dtype-datetime,dtype-time,dtype-decimal`
and under the bare feature set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eanups, expanded tests

Addresses the remaining ⏳ items from the second review pass:

**Bug-adjacent fixes**:
- `polars-mem-engine/src/planner/lp.rs`: explicit Vortex branch sets
  `create_skip_batch_predicate = false`. The previous default was implicit
  (table_statistics is None for Vortex), which would have silently flipped
  to true if PR-8 ever wires file-level stats. Vortex does zone pruning
  inside ScanBuilder via `LayoutReader::pruning_evaluation`; running Polars'
  per-row-group skip-predicate machinery on top would be wasted work and
  risk double-pruning false positives.

**Cleanups**:
- Folded `read/metadata.rs` (a 10-line file with one type alias) into
  `read/mod.rs`. Kept `polars_vortex::read::metadata::VortexFooterRef` as
  a compatibility re-export so downstream `polars-plan` paths still resolve.
- Promoted `tokio-util` to a workspace dep. polars-vortex now uses
  `tokio-util = { workspace = true, features = ["compat"], optional = true }`
  consistent with the workspace's other tokio deps.

**Test coverage** — Suite now 42 unit + 23 integration = 65 Rust tests:

*VortexCacheMode::resolve() — 3 unit tests* (`read::options::cache_mode_tests`):
- `Global` returns the same Arc as `session::segment_cache()` across calls.
- `Off` returns distinct fresh NoOp caches per call, not the global.
- `Dedicated(N)` returns distinct fresh Moka caches per call.

*PolarsInstrumentedVortexReadAt — 3 unit tests* (`read::read_at::tests`):
- A `CountingReadAt` fake `VortexReadAt` records reads. The decorator must
  forward each `read_at` call exactly once with the requested length.
- `concurrency()` and `coalesce_config()` correctly delegate to inner.
- `IOMetrics` instrumentation produces non-empty output after reads.

*Predicate convertor variant coverage — 8 unit tests*:
- Equal / Between / EqualOneOf / StartsWith (safe + unsafe bytes) / EndsWith
  / RegexMatch (residual) / EqualOneOf with partial failure (residual).
- Locks in the convertor's "all-or-nothing" semantics: if even one IN-list
  scalar fails to convert (e.g., contains AnyValue::Null), we fall back to
  residual rather than push a narrower predicate.

*Roundtrip — 4 new integration tests*:
- `roundtrip_datetime_with_timezone` — Datetime(Microseconds, UTC), 3 rows.
- `roundtrip_decimal_basic` — Decimal(10, 2), 4 rows including a negative.
- `roundtrip_decimal_with_nulls` — Decimal(8, 4) with mixed Some/None.
- `roundtrip_decimal_precision_38_boundary` — Decimal(38, 0) tests the
  schema converter's `precision <= 38 → Decimal` boundary in production.
- `segment_cache_second_read_produces_same_data` — smoke test for the cache
  wiring; two consecutive reads through the global cache must produce
  identical results.

Adds `regex` to dev-dependencies for the RegexMatch test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 2 wall-clock anchor)

New `crates/polars/benches/io_vortex.rs` with 3 Criterion benches:

  - `vortex_scan/no_filter` — baseline full-scan; identical perf on both
    phases (no convertor involvement). Normalizes throughput across machines.
  - `vortex_scan/filter_lt` — `col < N`. Phase 1: SpecializedColumnPredicate::Lt
    pushes down. Phase 2: AExpr-direct convertor's `lt` arm pushes down. Both
    paths reach Vortex's `LayoutReader::pruning_evaluation`; wall-clock should
    be ~equal across phases.
  - `vortex_scan/filter_arithmetic` — `col + 1 == N`. Phase 1: AExpr-direct
    path is absent, convertor returns None, scan decodes everything and
    Polars filters post-decode. Phase 2: convertor emits `eq(checked_add(...),
    lit(N))` and Vortex skips zones. Key Phase 1 → Phase 2 measurement.

100k rows × 2 columns synthetic file; sufficient to span multiple Vortex
zones (~8,192 rows each by default) so zone-pruning has something to skip.

The bench is gated on `required-features = ["vortex", "lazy"]`; runnable
combo is the same as Phase 1's exit-criterion (a):

```sh
cargo bench -p polars \
    --features vortex,cloud,parquet,dtype-full,strings \
    --bench io_vortex
```

The `cloud,parquet,dtype-full,strings` set is needed because polars-stream's
`lower_expr.rs:1530+` references `IRStringFunction::Strptime` inside a
`dtype-date/datetime/time`-gated arm — that arm pulls in `strings` from
polars-plan transitively. Documented in the bench's module-level comment.

Saving baselines + comparing across phases:

```sh
# On Phase 1's tip:
cargo bench ... -- --save-baseline phase-1
# On a later commit:
cargo bench ... -- --baseline phase-1
```

**Dep additions** (polars umbrella crate, Cargo.toml):
- `criterion = "0.5"` (dev-dep, no default features, `cargo_bench_support`
  only — minimum surface for `criterion_group!` / `criterion_main!`)
- `polars-vortex` (dev-dep, for `write_vortex` in bench setup; required-features
  gates the bench so polars-vortex only compiles when the bench would run)
- `tempfile = "3"` (dev-dep, mirrors polars-vortex's existing dev-dep usage)
- New `[[bench]]` entry for `io_vortex`

**Public-API surface fix** (`crates/polars-lazy/src/prelude.rs`):
- Re-export `ScanArgsVortex` from polars-lazy's prelude under `#[cfg(feature = "vortex")]`.
  Pre-PR-1.5 the type was reachable only through `polars_lazy::scan::vortex::ScanArgsVortex`
  but `scan` is `pub(crate)`, so external Rust callers couldn't construct it.
  Py-polars works around this by calling `DslBuilder::scan_vortex` directly,
  bypassing the public `LazyFrame::scan_vortex` API. The bench's external
  caller pattern is the actual user-facing usage; the re-export closes the
  gap so future Rust callers (other benches, integration tests, downstream
  crates) can construct ScanArgsVortex.

**README update** (`crates/polars-vortex/README.md`):
Added a "Run the Vortex Criterion benches" section to the existing "Test
recipes" listing, with the canonical `--save-baseline` / `--baseline`
invocation for cross-phase comparison.

**Verification**:
- `cargo check -p polars-lazy --features vortex` clean.
- `cargo check -p polars --benches --features vortex,lazy,cloud,parquet,dtype-full,strings` clean.
- `cargo fmt --all -- --check` clean.

Not run locally (would require ~30s+ wall-clock):
- Actual `cargo bench` execution — verified at compile + setup; saved as
  the user's next-step measurement to capture the Phase 1 baseline numbers
  before reviewing Phase 2's perf wins.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lwwmanning lwwmanning changed the title Vortex integration: Phase 1 (functional + benches) feat(vortex): Phase 1 — native file-format integration + Criterion benches May 18, 2026
@github-actions github-actions Bot added the enhancement New feature or request label May 18, 2026
@github-actions
Copy link
Copy Markdown

The uncompressed lib size after this PR is 59.3437 MB.

@lwwmanning lwwmanning changed the title feat(vortex): Phase 1 — native file-format integration + Criterion benches feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches May 18, 2026
@github-actions
Copy link
Copy Markdown

The uncompressed lib size after this PR is 59.5097 MB.

…ile-stats + multi-file tests into Phase 1

Re-stack: bring PR-3.1 + PR-3.2 work (originally on the Phase 2
follow-on branch) into the Phase 1 PR so the Phase 1 PR ships a
complete, robust, tested Vortex integration as a single reviewable
unit, with Phase 2's AExpr-direct convertor as a pure perf
follow-on.

Origin commits on vortex-integration branch (preserved as
backup-vortex-integration-pre-restack):
  746cc3e — PR-3.1 cycle 1 file-stats foundation
  d1dda5e — PR-3.1 cycle 2 must-fix (n_sources==1 gate)
  fad435b — PR-3.2 cycle 1 multi-file + missing_columns tests
  eef9643 — PR-3.2 cycle 2 (removed broken extra_columns tests)

Changes:

1. New module crates/polars-vortex/src/read/file_stats.rs (~200 LoC
   impl + ~110 LoC tests, copied verbatim). Helper
   footer_to_table_statistics(footer, file_schema) walks Vortex's
   FileStatistics::stats_sets() and builds a single-row DataFrame
   matching the {len, {col}_min, {col}_max, {col}_nc} contract.
   Soundness: only Precision::Exact values emitted; Inexact and
   missing become null (mem-engine treats as "unknown range").
   Handles all primitive dtypes + Boolean + String.

2. Extended vortex_file_info in scans.rs to return
   (FileInfo, Option<VortexFooterRef>, Option<DataFrame>) where the
   3rd element is the per-file stats DataFrame. Stats are extracted
   BEFORE row_index synthesis so the schema matches FILE columns.

3. Caller wires unified_scan_args.table_statistics =
   Some(TableStatistics(Arc::new(stats_df))) gated on n_sources == 1.
   Vortex is the FIRST Polars format to populate this — Parquet
   hard-codes None at polars-plan/src/dsl/file_scan/mod.rs.

4. Removed the Vortex-specific force-disable of
   create_skip_batch_predicate at
   crates/polars-mem-engine/src/planner/lp.rs. The two pruning
   layers (file-level skip_batch_predicate via {col}_min/max/nc
   DataFrame vs Vortex's zone-level LayoutReader::pruning_evaluation
   inside ScanBuilder) are complementary, not redundant.

5. Added 6 Python e2e tests to test_vortex.py:
   - test_scan_with_file_stats_smoke — single-file populate path
   - test_scan_with_file_stats_multifile_does_not_panic — verifies
     n_sources==1 gate prevents the original cycle-1 hard panic at
     polars-mem-engine/src/scan_predicate/functions.rs:397
   - test_multifile_scan_shape_and_ordering — path-list ordering
   - test_multifile_scan_missing_columns_insert/_raise — schema
     evolution policy plumb-through

10 new unit tests added (in file_stats.rs).

Cycle 1 limitations (carry-forward from origin work):
- Single-file scans only — multi-file stats aggregation tracked
  as Deferred work (~50 LoC follow-up to read all source footers
  in parallel).
- extra_columns is no-op for non-Parquet formats per upstream
  Polars limitation at lower_ir.rs:925-936 — tracked as Deferred.

Verification:
  cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings → clean
  cargo test -p polars-vortex --lib file_stats → 10 passed
  cargo fmt --check → clean
  uv tool run ruff check py-polars/ → clean
…orward fixes (4 items)

PR-2.0 housekeeping commit 2 of 3. Addresses 4 cycle-3 should-fix code-doc
carry-forward items:

1. Producer-error inline comment at vortex sink mod.rs:127-137 trimmed from
   18 lines (~370 chars) to 9 lines, removing the inaccurate "whichever
   fires first" framing. The producer/writer error path is deterministic
   per await ordering (producer.await? sees the producer's terminal value
   before write_handle.await runs); user-visible error is always the
   producer's PolarsError, channel-Err is defense in depth. Aligns with
   cycle-2 correctness lens trace.

2. lib.rs:14-21 narrowing comment replaced the path-by-path enumeration
   (which drifted: claimed "8 paths" but enumerated 12) with the simpler
   invariant: "anything outside vortex::{array, error, file, io, layout}
   fails to compile". Plus a re-audit instruction for adding new paths.

3. read/predicate.rs:1-8 added TODO(PR-2.6) scaffolding marker explicitly
   naming the file as the deletion target for the PR-13 Option B → A
   cutover, with the call-site at io_sources/vortex/mod.rs:242 referenced
   for the switch.

4. io_sinks/writers/vortex/mod.rs:115 added a 6-line comment documenting
   the morsel_rx.recv() Err-as-EOS contract: upstream-channel-drop is
   clean EOS; producer errors flow via the dedicated chunk_tx channel +
   spawned-task Err return, NOT through morsel_rx. Names the Polars-wide
   sink-task pattern (CSV/IPC/NDJSON/Parquet all do the same).

Verified clean: `cargo check -p polars-stream --features vortex,cloud`
finishes in 4s with only pre-existing warnings (no new clippy/check
issues introduced).

(cherry picked from commit 3d8da4d)
…Dedicated single-cache via FileScanIR thread-through

Closes cycle-3's "Dedicated double-resolve" should-fix (maint
hidden-assumption + arch fragmentation; same root cause from two
angles). Pre-PR-2.0, `VortexCacheMode::Dedicated(N).resolve()` was
called TWICE per logical scan — once at IR-build for the postscript
schema-discovery read, once at streaming-source-time for the data
read — producing TWO independent Moka cache instances per scan.
Segments fetched during discovery didn't carry into the data read;
the user paid 2x cache allocation cost AND lost the discovery →
data-read prefetching benefit. Strictly farther from documented
Dedicated semantics ("one per-scan cache").

This commit threads one resolved cache through the IR. Files touched:

1. polars-vortex/src/read/mod.rs: introduce VortexSegmentCacheRef
   newtype wrapping Arc<dyn SegmentCache>. Newtype (not type alias)
   because dyn SegmentCache doesn't impl Debug — wrapper provides an
   opaque Debug impl, Deref to the inner Arc, and From<Arc<dyn>>.

2. polars-plan/src/dsl/file_scan/mod.rs: add segment_cache: Option<
   VortexSegmentCacheRef> field to FileScanIR::Vortex (Some at the
   schema-discovery-read path; None at the user-supplied-schema path).
   Add the same to FileScanEqHashWrap::Vortex (pointer-equality via
   the existing arc_as_ptr helper) so plan-cache identity is
   preserved.

3. polars-plan/src/plans/conversion/dsl_to_ir/scans.rs: resolve once
   at the caller of vortex_file_info; pass the cloned wrapper to both
   the IR-build postscript read AND into FileScanIR::Vortex.segment_
   cache for downstream consumption. Function-level doc-comment added
   to vortex_file_info naming the contract; use the type alias at the
   signature site to drop the unwieldy fully-qualified path.

4. polars-stream/src/physical_plan/lower_ir.rs: destructure
   segment_cache and thread to VortexReaderBuilder.

5. polars-stream/src/nodes/io_sources/vortex/builder.rs: add
   segment_cache field to VortexReaderBuilder; pass through to
   VortexFileReader at build_file_reader time.

6. polars-stream/src/nodes/io_sources/vortex/mod.rs: add segment_cache
   field to VortexFileReader; in initialize(), prefer the threaded
   cache and fall back to options.segment_cache.resolve() only on the
   None path (user-supplied-schema case where no IR-build postscript
   read happened). Global/Off unaffected because .resolve() is
   idempotent for those variants.

7. polars-mem-engine/src/scan_predicate/functions.rs: pattern-match
   update for the new field.

Plan-doc tweak: corrected the polars-vortex crate inventory at
plan:96 to actual breakdown (43 unit + 23 integration = 66 Rust
tests; verified via cargo test output).

Verified:
- cargo check -p polars --features vortex,cloud,parquet,dtype-full clean
- cargo check -p polars-stream --features vortex,cloud clean
- cargo check -p polars-python --features new_streaming clean
- cargo test -p polars-vortex --features dtype-date,dtype-datetime,
  dtype-time,dtype-decimal → 43 + 23 = 66 tests pass (no regressions)

(cherry picked from commit 5787a66)
…e + drop on cache pressure (review finding)

(cherry picked from commit 2a0664c)
…ion tests (review finding)

PR-2.0 cycle-1 must-fix C-003: PR-2.0 acceptance criterion (c) requires a
Dedicated(N) regression test verifying single-cache identity through the
IR thread-through; cycle-1 gauntlet correctness-lens flagged the
contradiction with the Deferred-work entry's "STILL DEFERRED" framing.

This commit adds 3 unit tests in crates/polars-vortex/src/read/mod.rs:

1. `clone_preserves_arc_identity` — VortexSegmentCacheRef::clone must NOT
   clone the inner Arc<dyn SegmentCache>. If it did, the IR thread-through
   (which clones across VortexReaderBuilder → VortexFileReader per file)
   would silently re-introduce the cycle-3 double-cache bug.
2. `from_arc_preserves_identity` — `Into::into` must be a thin newtype
   wrap, not a clone-then-allocate.
3. `deref_returns_inner_arc_by_reference` — `Deref` must hand out
   `&Arc<dyn SegmentCache>` pointing at the SAME inner Arc.

All three verify the single-cache invariant at the wrapper level. The
corresponding end-to-end integration test (`pl.scan_vortex(...,
cache_mode=Dedicated(N))` confirming ONE Moka cache instance across
IR-build + streaming reads) is deferred to PR-3.1's table_statistics
work — PR-3.1 already touches vortex_file_info and is the natural place
to add the end-to-end harness.

Test count: 43 unit + 23 integration = 66 → 46 unit + 23 integration = 69.

Verified:
- All 3 new tests pass
- cargo test -p polars-vortex --features dtype-date,dtype-datetime,dtype-time,dtype-decimal: 46 + 23 = 69 tests, 0 failures

(cherry picked from commit 869120d)
… FileScanIR::Vortex (vortex+python compile error; review finding)

(cherry picked from commit 8036f13)
…3 must-fix; CI rustfmt green-up)

(cherry picked from commit b2252b1)
2026-05-19 PR-stack restructure: this branch (vortex-integration-phase-1)
now contains Phase 1 foundation + Phase 3 file-stats / multi-file tests +
Phase 2 PR-2.0 cleanups (segment_cache thread-through + code/plan-doc
carry-forwards + C-001/C-002/C-003/C2-001 fixes). The remaining Phase 2 PR
(branch vortex-integration) is now a pure AExpr-direct convertor follow-on
stacked on this Phase 1++ tip.

Backup branches preserve pre-restack state:
- backup-vortex-integration-phase-1-pre-restack (this branch's pre-restack tip)
- backup-vortex-integration-pre-restack (Phase 2 branch's pre-restack tip)

Plan rewrite + full Implementation status update deferred to next session
(PR-3.3 implementation + Phase 4 enhanced benches happen in fresh context).

Current State header updated to reflect the new structure; the rest of the
plan doc (Architecture, Phases and PRs, Implementation status, Deferred
work, etc.) is stale relative to the new structure and will be fully
rewritten next session.
@tl-dr-review tl-dr-review Bot added the tldr label May 19, 2026
@github-actions
Copy link
Copy Markdown

The uncompressed lib size after this PR is 59.5099 MB.

@github-actions
Copy link
Copy Markdown

The uncompressed lib size after this PR is 59.5099 MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant