feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches#2
Open
lwwmanning wants to merge 146 commits into
Open
feat(rust): Vortex Phase 1 — native file-format integration + Criterion benches#2lwwmanning wants to merge 146 commits into
lwwmanning wants to merge 146 commits into
Conversation
Adds a public `handle()` method that returns an owned `tokio::runtime::Handle` to Polars' global `ASYNC` Tokio runtime. This is a one-line precursor to the Vortex integration: downstream libraries (e.g., Vortex) accept a Tokio handle directly so they can share Polars' runtime instead of spinning up their own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation
Creates the foundational `polars-vortex` crate that will host the Vortex
file-format reader and writer. Mirrors the structure of `polars-parquet` —
a peer crate that wraps the Vortex APIs (`vortex-file`, `vortex-scan`,
`vortex-layout`, etc.) behind a Polars-friendly facade.
What's in this commit:
* New `crates/polars-vortex/` crate, gated on a future `vortex` feature.
* `session::session()` — a process-global `VortexSession` whose runtime is
Polars' global `ASYNC` Tokio runtime (via a `vortex_io::runtime::Handle`
built from `ASYNC.handle()`). Avoids a second Tokio runtime and the
single-threaded `CurrentThreadRuntime` bottleneck.
* `session::segment_cache()` — a process-global Moka-backed segment cache
shared by every Vortex scan, sized via `POLARS_VORTEX_CACHE_BYTES`
(default 512 MiB; `0` disables). Cross-query reuse of decompressed
segments is one of Vortex's biggest wins over Parquet.
* `read::options::VortexScanOptions` — the option struct that will live in
the new `FileScanIR::Vortex` variant. Kept compact for the IR's
size-80 assertion. Tunables: `schema` (skip-inference shortcut),
`use_statistics`, `push_predicate`, `push_projection`, `cache` mode,
`aggressive_pushdown`, `scan_concurrency`, `initial_read_size`.
* `read::options::VortexCacheMode` — Global / Off / Dedicated(bytes).
* `write::options::{VortexWriteOptions, VortexLayoutKind, VortexCompression}`
— write-side options stub, filled in by the writer PR.
* Placeholder modules for `read::{metadata, predicate, projection,
read_at, schema}` and `write::*` to be implemented in subsequent PRs.
Workspace wiring:
* `vortex` added to workspace deps as a path dep pointing at a sibling
`/Users/will/git/vortex/` checkout. For a public release this would
point at a published crates.io version.
* `polars-vortex` added to workspace deps.
* Cargo features: `cloud` (object_store integration), `serde`, `dsl-schema`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ng engine
End-to-end plumbing for a `pl.scan_vortex(...)` user-facing API, gated on the
new `vortex` feature flag. After this commit, the LazyFrame -> IR -> streaming
engine path is structurally complete: a `FileScanIR::Vortex` flows through
optimization, scan-source expansion, hive-partition discovery, slice/predicate
pushdown setup, and physical planning. The streaming source node itself is a
stub that errors at runtime ("not yet implemented") — landing the actual
open/scan/decode loop is the next sub-PR.
`cargo check -p polars --features vortex` succeeds. Default build is
unaffected (the `vortex` feature is opt-in everywhere).
What's wired:
* `FileScanDsl::Vortex { options }` and `FileScanIR::Vortex { options,
metadata }` variants in `polars-plan/src/dsl/file_scan/mod.rs`, with
`_file_scan_eq_hash::FileScanEqHashWrap::Vortex` for the manual Eq/Hash
trait (pointer-eq on the cached Footer, matching Parquet/IPC).
* `FileScanIR::flags()` returns `SPECIALIZED_PREDICATE_FILTER` for Vortex
(lets the optimizer push `ColumnPredicateExpr` variants — Equal/Between/
EqualOneOf/StartsWith/EndsWith/RegexMatch — which Vortex's expression IR
fully covers).
* `FileScanIR::streamable()` returns true for Vortex.
* `DslBuilder::scan_vortex` and the `scans::vortex_file_info` helper in the
DSL -> IR conversion pass. Schema inference is stubbed (returns a clear
error directing users to pass `schema=...`); the file-open path is
populated by a follow-up that ships the Vortex-DType -> polars-arrow
schema converter.
* `LazyFrame::scan_vortex` + `scan_vortex_sources` + `scan_vortex_files`
user-facing entry points in `polars-lazy/src/scan/vortex.rs`, with a
full `ScanArgsVortex` mirroring `ScanArgsParquet` plus Vortex-specific
knobs (`use_statistics`, `push_predicate`, `push_projection`,
`aggressive_pushdown`, `initial_read_size`, `scan_concurrency`,
`segment_cache`).
* `polars-stream/src/nodes/io_sources/vortex/{mod,builder}.rs` with a
`VortexReaderBuilder` (`FileReaderBuilder` impl) and `VortexFileReader`
(`FileReader` impl). Capabilities advertised: `ROW_INDEX | PRE_SLICE |
PARTIAL_FILTER | MAPPED_COLUMN_PROJECTION`. The `initialize()` and
`begin_read()` methods currently return "not yet implemented" errors —
the next sub-PR fills in the open/scan/morsel loop.
* `lower_ir.rs` lowers `FileScanIR::Vortex` to `VortexReaderBuilder`,
alongside the existing Parquet/IPC/CSV/etc. arms.
* `slice_pushdown_lp.rs` allows slice pushdown for Vortex scans.
* `polars-mem-engine/src/scan_predicate/functions.rs` zeros out the
Vortex metadata field when invalidating the predicate cache (mirrors
the Parquet/IPC behavior).
* `polars-plan/src/dsl/scan_sources.rs` adds Vortex to the
`expand_paths_with_hive_update` feature gate.
* `polars-vortex/src/lib.rs` re-exports `::vortex` so downstream Polars
crates can refer to upstream Vortex types (`vortex::file::Footer`) via
`polars_vortex::vortex::...` without taking a direct `vortex` dep.
Feature-flag wiring:
* `polars-plan`: adds `vortex = ["polars-vortex"]`, with `polars-vortex`
as an optional dep.
* `polars-lazy`: adds `vortex = ["dep:polars-vortex", "polars-plan/vortex",
"polars-mem-engine/vortex", "polars-stream?/vortex"]`.
* `polars-stream`: adds `vortex = ["polars-mem-engine/vortex",
"polars-plan/vortex", "dep:polars-vortex"]`.
* `polars-mem-engine`: adds `vortex = ["polars-plan/vortex"]`.
* `polars` (umbrella): adds `vortex = ["polars-lazy?/vortex",
"new_streaming"]`.
What's deliberately left for follow-up PRs (PR-2c/d and PR-3+):
* The actual streaming open/scan/morsel loop in `VortexFileReader`
(uses `VortexOpenOptions::open`, `ScanBuilder::into_stream`, the
`PolarsInstrumentedVortexReadAt` decorator).
* Vortex `DType` -> polars-arrow `ArrowSchema` converter (so
`vortex_file_info` can do real schema inference).
* `polars_expr_to_vortex` predicate convertor (PR-3).
* Cloud `VortexReadAt` over `polars_io::cloud::build_object_store` (PR-5).
* Eager / streaming writer (`sink_vortex`, PR-9/10).
* Python bindings (PR-12).
See `.claude/plans/we-want-to-scope-transient-patterson.md` for the full
implementation plan.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-open Lights up real schema discovery for `pl.scan_vortex(...)`. After this commit, `scan_vortex(path).collect_schema()` works end-to-end on local files (and in-memory buffers). The `.collect()` morsel-emitting path is still pending — that's the final PR-2 sub-task. What's new: * **`polars-vortex/src/read/schema.rs`** — Vortex `DType` → polars-arrow `ArrowSchema` + Polars `Schema` translation. The two Arrow crates are nominally distinct (polars-arrow is the internal fork), so we walk Vortex's `DType` enum recursively and emit polars-arrow types directly, bypassing the upstream-arrow intermediate that `DType::to_arrow_schema` would produce. Covers Null, Bool, all PType variants, Decimal/Decimal256, Utf8View, BinaryView, List, FixedSizeList, Struct, and the temporal extension types (Timestamp/Date32/Date64/Time32/Time64). Union, Variant, and unknown extensions bail with a clear error. * **`polars-vortex/src/read/read_at.rs`** — `PolarsInstrumentedVortexReadAt` decorator over `vortex::io::VortexReadAt`. Each `read_at` runs inside `polars_io::pl_async::with_concurrency_budget(1, …)` (so Vortex shares Polars' global concurrency cap with Parquet et al.) and threads bytes through `OptIOMetrics::record_io_read` for observability. Zero buffer copies — the inner reader's `BufferHandle` passes through. Also exposes `local_file_read_at(path, metrics)` and `in_memory_read_at(bytes, uri, metrics)` factories. * **`polars-stream/src/nodes/io_sources/vortex/mod.rs`** — `VortexFileReader::initialize()` now actually opens the file via `VortexOpenOptions`, attaches the process-global segment cache, caches the `Footer` (so re-conversions of the IR are cheap), and exposes the schema through `file_schema()` / `file_arrow_schema()` / `fast_n_rows_in_file()` (the trait hooks the multi-scan layer uses before `begin_read`). `begin_read()` itself is still a stub — `initialize()` succeeding is enough for schema-only consumers, plus it fully exercises the open path which validates the runtime wiring. * **`polars-plan/src/plans/conversion/dsl_to_ir/scans.rs`** — real `vortex_file_info`. Opens the first source via Polars' shared session, pulls the DType through the schema converter, populates `FileInfo` (schema + reader_schema + row_estimation), caches `Arc<Footer>` for reuse, and inserts the row-index column when requested. The previous stub (which forced users to pass `schema=…`) is gone — schema inference works the same way Parquet/IPC do. Verified: `cargo check -p polars --features vortex,parquet` passes (~6 s warm). Default build unaffected. Cargo: enables `polars-io/async` for `polars-vortex` (the `pl_async` module that backs `with_concurrency_budget`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…copy bridge
Implements `VortexFileReader::begin_read()`: opens the scan, drives Vortex's
`ScanBuilder::into_array_stream()`, converts each `ArrayRef` into an upstream
`RecordBatch` via `ArrowArrayExecutor::execute_record_batch`, bridges that
zero-copy into a Polars `DataFrame`, and emits morsels with backpressure. After
this commit, `pl.scan_vortex(path).collect()` actually reads data.
The bridge: `polars-vortex/src/read/array_bridge.rs`
Vortex emits upstream-Arrow types (`arrow_array::RecordBatch`); Polars uses its
internal-fork `polars-arrow`. Both crates implement the *same* Arrow C Data
Interface struct layout (`#[repr(C)]`, 9 identical fields, per spec), so we move
between them via the C ABI:
upstream array
-> `arrow_array::ffi::to_ffi(&ArrayData)` -> `FFI_ArrowArray`
-> `mem::transmute` (verified at compile time via `size_of` assertion)
-> polars-arrow `ArrowArray`
-> `polars_arrow::ffi::import_array_from_c(array, dtype)` -> `Box<dyn Array>`
-> `Series::from_arrow(name, array)` -> Polars `Series` -> `Column` -> `DataFrame`
No buffer copies — only the ~80-byte C-ABI struct is moved. The upstream array's
backing buffers stay in place; reference counts on both sides keep them alive
until polars-arrow drops the import.
This matches what `vortex-duckdb` does (DuckDB extension framework does the same
C-ABI handoff), and is the standard cross-crate Arrow interop pattern.
The streaming reader (`polars-stream/src/nodes/io_sources/vortex/mod.rs`):
* `InitializedState` now caches both polars-arrow `ArrowSchema` and upstream
`arrow_schema::Schema` (needed by `execute_record_batch`), plus the per-column
polars-arrow `ArrowDataType` vector (handed to the bridge each batch).
* `begin_read` builds a `ScanBuilder` from the cached `VortexFile`, applies
`pre_slice` as a positive `row_range` (negative slice and predicate pushdown
are PR-3/PR-4), spawns a `TaskPriority::Low` task on the streaming
`async_executor`, and pumps morsels via `FileReaderOutputSend::new_serial`
with proper backpressure (`send_morsel`'s wait_group).
What works now:
* `pl.scan_vortex(path).collect()` on local files and in-memory buffers.
* `pl.scan_vortex(path).slice(offset, len)` — pre_slice pushed to ScanRequest.
* Schema discovery via `vortex_file_info` (the previous commit).
What's still pending: PR-3 (filter pushdown via AExpr -> Vortex Expression),
PR-4 (negative slice + row-index), PR-5 (cloud), the writer (PR-9/10), Python
(PR-12), benches (PR-14).
Cargo: adds `arrow-array = "58"`, `arrow-data = "58"`, `arrow-schema = "58"`
to `polars-vortex` for the bridge. Pinned to the same major Vortex pulls in
transitively, so no duplicate-crate risk.
Verified: `cargo check -p polars --features vortex` (0.14 s warm),
`cargo check -p polars` (default, no vortex) — both pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pression
Adds the AExpr / specialized-predicate → Vortex `Expression` convertor and wires it
into `VortexFileReader::begin_read`. For predicates the Polars optimizer can extract
as `SpecializedColumnPredicate`, we now hand a Vortex filter to `ScanBuilder::with_filter`
instead of materializing every batch and applying the residual post-decode. This is
the meat of the "maximal Vortex pushdown" goal from the plan.
The convertor (`polars-vortex/src/read/predicate.rs`):
* Walks `ScanIOPredicate::column_predicates.predicates` — Polars exposes per-column
`(PhysicalIoExpr, Option<SpecializedColumnPredicate>)`. We pattern-match the
specialized variant (no need to crack open the opaque PhysicalIoExpr).
* Translates the structured forms into Vortex Expression trees:
- `Equal(scalar)` → `eq(get_item(col, root()), lit(scalar))`
- `Between(lo, hi)` → `and(gt_eq(col, lo), lt_eq(col, hi))` (closed range)
- `EqualOneOf(scalars)` → `or_collect` of `eq(col, lit(s))` — all-or-nothing on
scalar conversion so we never push a *narrower* filter than the user wrote.
- `StartsWith` / `EndsWith` / `RegexMatch` → not pushed yet; tracked under PR-3
follow-ups (Vortex's `like` needs separate `LikeOptions` plumbing).
* `polars_scalar_to_vortex` covers the primitive AnyValue variants
(bool / Int8..Int64 / UInt8..UInt64 / Float32 / Float64 / String / Binary). Unknowns
return `None` and that column's specialized predicate stays as residual.
The wiring (`polars-stream/src/nodes/io_sources/vortex/mod.rs`):
* `begin_read` calls `polars_to_vortex_predicate(scan_predicate)` gated on
`VortexScanOptions.push_predicate` (default `true`). The returned `Expression` is
handed to `ScanBuilder::with_filter`.
* We continue to advertise `ReaderCapabilities::PARTIAL_FILTER`, so the multi-scan
layer keeps the original `ScanIOPredicate::predicate` around and re-applies it to
emitted morsels. Pushing a *partial* predicate is safe: Vortex prunes/filters what
it can, the residual catches the rest.
What this unlocks:
* `pl.scan_vortex(path).filter(pl.col("a") == 42).collect()` — `a == 42` becomes a
Vortex filter expression. Vortex's `LayoutReader::pruning_evaluation` (running
inside `ScanBuilder`) uses zone statistics to skip whole chunks where `a == 42`
is impossible. This is the big perf win over the previous commit's "decode
everything, then filter post-decode" pattern.
* `filter(pl.col("a").is_between(0, 100))`, `filter(pl.col("a").is_in([1, 2, 3]))` —
same.
Still residual (multi-scan layer applies post-decode):
* String pattern predicates (StartsWith/EndsWith/RegexMatch) — Vortex's `like`
builder takes `LikeOptions` we haven't plumbed yet; one follow-up.
* Multi-column predicates (e.g. `col("a") > col("b")`) and arithmetic — the
optimizer doesn't lift these into `SpecializedColumnPredicate`, so they reach us
only as `PhysicalIoExpr`. Lifting them is a larger PR that walks `AExpr` at IR
conversion time.
Verified: `cargo check -p polars --features vortex` passes (0.14 s warm).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds string-prefix and string-suffix pushdown to the predicate convertor. Polars
emits `SpecializedColumnPredicate::StartsWith(bytes)` for `pl.col("x").str.starts_with(...)`
and similarly for EndsWith; we convert to `LIKE 'prefix%'` / `LIKE '%suffix'` and
hand to Vortex's `like` expression builder.
Safety: we refuse pushdown when the bytes contain SQL-LIKE wildcards (`%`, `_`)
or the escape (`\`). LIKE would *widen* the pattern in those cases — which is
still correct (the multi-scan layer always re-applies the original predicate
post-decode), just wasteful of I/O. Rather than push a wasteful pattern, we
leave that to the residual filter.
Now pushable: `filter(pl.col("name").str.starts_with("foo"))`,
`filter(pl.col("name").str.ends_with(".csv"))`.
Verified: `cargo check -p polars-vortex` passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…adAt
Adds S3 / GCS / Azure / HTTP support to `scan_vortex`. After this commit,
`pl.scan_vortex("s3://bucket/key.vortex", storage_options={...})` works.
Implementation:
* `polars-vortex/src/read/read_at.rs::cloud_read_at(path, cloud_options, io_metrics)`
— async factory gated on `cfg(feature = "cloud")`. Goes through Polars'
`polars_io::cloud::build_object_store` so all of `CloudOptions` (auth, retry
config, endpoint overrides, credential providers) is honored. The resulting
`Arc<dyn ObjectStore>` is handed to `vortex::io::object_store::ObjectStoreReadAt`
along with the runtime handle, then wrapped in `PolarsInstrumentedVortexReadAt`
for the standard concurrency-budget + IOMetrics instrumentation.
* `polars-stream/src/nodes/io_sources/vortex/mod.rs` — `build_read_at` gains a
`#[cfg(feature = "cloud")]` arm for cloud paths; the `cfg(not(feature = "cloud"))`
fallback emits a clear "rebuild with --features vortex,cloud" message.
* `object_store` added to `polars-vortex` as an optional dep, gated on the
`cloud` feature. The version is workspace-pinned so it resolves to the same
fork (`kdn36/arrow-rs-object-store`) Polars patches in via `[patch.crates-io]`.
Vortex's transitively-pulled `object_store` resolves to the same patched
version, so the `Arc<dyn ObjectStore>` types are identical across the bridge —
no duplicate-crate risk.
Feature-flag propagation:
* `polars-lazy/cloud` now also enables `polars-vortex?/cloud` (optional, only
fires when `polars-vortex` is itself in the dep graph).
* `polars-stream/cloud` similarly.
* User-facing: `cargo check -p polars --features vortex,cloud`.
Verified: `cargo check -p polars --features vortex,cloud` passes (10 s warm),
`cargo check -p polars --features vortex` (local-only) still passes, and the
default build is unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the user-facing Python API for Vortex reads. After this commit:
import polars as pl
lf = pl.scan_vortex("data.vortex").filter(pl.col("a") == 42).select("b", "c")
df = lf.collect()
Wires through the full feature surface from the prior commits: schema discovery,
predicate pushdown (Equal/Between/EqualOneOf/StartsWith/EndsWith), cloud reads
via storage_options, segment cache, slice pushdown.
Rust side (`polars-python`):
* `PyLazyFrame::new_from_vortex` — pyo3 staticmethod taking sources, schema,
scan_options, and Vortex-specific knobs (use_statistics, push_predicate,
push_projection, aggressive_pushdown, initial_read_size, scan_concurrency).
Builds a `VortexScanOptions` and goes through `DslBuilder::scan_vortex`.
* `crates/polars-python/src/lazyframe/visitor/nodes.rs::scan_type_to_pyobject`
— adds the `FileScanIR::Vortex` arm so the IR visitor exposes Vortex scans
to Python introspection (used by EXPLAIN, etc.).
* New `polars-python` feature `vortex` that gates the bindings.
Python side (`py-polars/src/polars/io/vortex/`):
* `functions.py::scan_vortex` — `pl.scan_vortex(...)`. Docstring covers the
Vortex-specific tunables and points users at `storage_options` for cloud.
* `functions.py::read_vortex` — `pl.read_vortex(...)`, a thin
`scan_vortex(...).collect()` wrapper for the eager API.
* `__init__.py` re-exports both.
* `polars/io/__init__.py` and top-level `polars/__init__.py` add the new
symbols to their public imports and `__all__` lists, so `import polars as pl;
pl.scan_vortex(...)` works.
Wiring up the feature flags surfaced two unrelated issues that needed fixing:
* `polars-plan/serde` now propagates to `polars-vortex?/serde`. Without this,
`VortexScanOptions` (which lives inside `FileScanIR::Vortex`) doesn't
satisfy `Serialize` / `Deserialize` when `polars-plan`'s serde is on.
* `polars-plan/src/plans/optimizer/expand_datasets.rs:282` had a non-exhaustive
match on `FileScanDsl` that didn't include the Vortex variant. Added the arm.
* `polars-sql/src/context.rs::execute_select` — when `vortex` is enabled, Vortex's
transitive deps pull in `hashbrown 0.17` alongside Polars' workspace-pinned
`hashbrown 0.16`, breaking type inference on three `PlHashSet::new()` /
`PlHashMap::with_capacity()` calls. Added explicit `<&str>` annotations so
inference succeeds regardless of which hashbrown is in scope. Identical
behavior, just more explicit.
Verified: `cargo check -p polars-python --features vortex` passes (0.20 s warm).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round-trips Polars DataFrames into Vortex files using the same C-ABI
zero-copy trick as the read path. After this commit:
use polars_vortex::write::write_vortex;
let df = pl.read_vortex("input.vortex");
write_vortex(&df, "output.vortex", &VortexWriteOptions::default())?;
The reverse bridge (`polars-vortex/src/write/array_bridge.rs`):
* `polars_array_to_upstream(Box<dyn Array>, &Field)` — moves one polars-arrow
column into an upstream `arrow_array::ArrayRef` via `export_array_to_c` +
`mem::transmute` + `arrow_array::ffi::from_ffi`. Zero buffer copy. Same
9-field `#[repr(C)]` layout argument as the read-side import; compile-time
`size_of` asserts on the FFI struct identity.
* `polars_chunk_to_upstream_record_batch(Vec<Box<dyn Array>>, &Schema)` —
builds an upstream `RecordBatch` from a single polars-arrow chunk of columns.
DataFrame -> Vortex chunks (`write/df_to_stream.rs`):
* `dataframe_to_vortex_chunks(&DataFrame) -> (DType, Vec<ArrayRef>)` —
iterates the DataFrame's chunked columns, bridges each chunk to an upstream
`RecordBatch`, wraps as a `StructArray`, then ingests via Vortex's
`FromArrowArray<&StructArray>`. The top-level Vortex `DType::Struct` is
derived once from the first chunk's upstream schema using Vortex's
`FromArrowType<&Schema>`.
* Refuses misaligned chunks (different chunk counts per column) with a clear
error telling the user to call `df.rechunk()` first.
The writer entry point (`write/writer.rs`):
* `write_vortex(&df, path, &options)` — opens a tokio file, wraps the chunks
as an `ArrayStreamAdapter`, and drives `VortexWriteOptions::write` on
Polars' global ASYNC Tokio runtime via `ASYNC.block_on`. Default
`VortexWriteOptions` produces a BtrBlocks-compressed Zoned layout.
What's pending in the write path:
* `LazyFrame::sink_vortex` and `FileWriteFormat::Vortex` — needs an IR variant
similar to the read-side variant (FileScanIR::Vortex), the `FileWriterStarter`
impl in `polars-stream/src/nodes/io_sinks/writers/`, and the lower_ir.rs arm.
Tracked as PR-10 / PR-11 in the plan.
* Schema-side dtype coverage in `df_to_stream::dataframe_to_vortex_chunks` —
the C-ABI bridge handles every dtype at the array level (it just moves bytes),
but the top-level DType derivation goes through Vortex's
`FromArrowType<&Schema>` which already handles primitives, decimal, utf8,
binary, list, struct, fixed-size list, and the temporal extensions. So this
*should* be feature-complete; tested cases are TBD.
* Python bindings for `df.write_vortex(path)` / `lf.sink_vortex(path)` —
follow-up to PR-12.
Verified: `cargo check -p polars --features vortex` passes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…WriteFormat::Vortex
After this commit, `lf.sink_vortex("out.vortex")` works from Python: the streaming
engine drains morsels, converts each DataFrame chunk into a Vortex `ArrayRef`
zero-copy via the reverse C-ABI bridge, and feeds them into `VortexWriteOptions::write`
on the global ASYNC runtime. This is the streaming analogue of the eager
`polars_vortex::write::write_vortex` from PR-9.
What's wired:
* `FileWriteFormat::Vortex(Arc<VortexWriteOptions>)` variant added in
`polars-plan/src/dsl/options/mod.rs`, including the `extension()` arm.
* `polars-stream/src/nodes/io_sinks/writers/vortex/mod.rs::VortexWriterStarter`
— `FileWriterStarter` impl. `start_file_writer`:
- Derives the top-level Vortex `DType` from the file schema upfront via
`polars_schema_to_vortex_dtype` (new helper in `polars-vortex` — converts
polars `Schema` → polars-arrow `ArrowSchema` → upstream `arrow_schema::Schema`
field-by-field via the C-ABI struct transmute → Vortex `DType` via
`FromArrowType<&Schema>`).
- Spawns a producer task that drains `morsel_rx` and converts each `DataFrame`
to Vortex `ArrayRef`s via the C-ABI bridge, sending through a bounded
`futures::channel::mpsc` for backpressure.
- Wraps the channel as an `ArrayStreamAdapter` and calls
`VortexWriteOptions::write(tokio_file, stream)` on `ASYNC.spawn(…)`.
* `create_file_writer_starter` arm in
`polars-stream/src/nodes/io_sinks/writers/mod.rs` — picks the
`VortexWriterStarter` when the `FileWriteFormat::Vortex` variant is present.
* Two non-exhaustive-match arms in `polars-stream/src/physical_plan/fmt.rs`
(FileSink + PartitionedSink labels in EXPLAIN output).
* `PyLazyFrame::sink_vortex` in `polars-python` — staticmethod, mirrors
`sink_parquet` and packages the call into `self.ldf.sink(...,
FileWriteFormat::Vortex(...), ...)`.
* `LazyFrame.sink_vortex(path, ...)` in `py-polars/src/polars/lazyframe/frame.py`
— user-facing Python API. Mirrors `sink_parquet` minus the Parquet-specific
knobs (compression, row groups, etc., which are baked into Vortex's layout
strategy).
Current limits / follow-ups:
* The sink only writes to local file paths; `Writeable::Cloud` paths error with
a clear message directing users to write locally and upload. Wiring cloud
sinks needs a `vortex-io::tokio::AsyncWrite` adapter for the cloud writer —
parallel to the read-side cloud bridge in PR-5.
* `VortexWriteOptions` is currently passed with defaults; surfacing `layout` /
`compression` knobs through Python is the next polish PR.
* Partitioned writes (`SinkTypeIR::Partitioned`) should work once the basic
sink lands, but that's untested here.
Verified: `cargo check -p polars --features vortex,cloud,parquet` passes
(0.14 s warm). Default build unaffected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che knob
User-facing tunable for the process-global Vortex segment cache (defaults to
512 MiB, also tunable via POLARS_VORTEX_CACHE_BYTES env). After this commit:
import polars as pl
pl.set_vortex_cache_bytes(2 * 1024**3) # 2 GiB
pl.set_vortex_cache_bytes(0) # disable
Wires straight through to `polars_vortex::session::set_global_cache_bytes`,
which atomically swaps the global `Arc<dyn SegmentCache>` (MokaSegmentCache vs
NoOpSegmentCache).
* `polars-python/src/functions/meta.rs::set_vortex_cache_bytes` — pyfunction
gated on `feature = "vortex"`.
* Registered in `polars-python/src/c_api/mod.rs` alongside the other meta
functions.
* `py-polars/src/polars/io/vortex/functions.py::set_vortex_cache_bytes` —
Python wrapper with docstring.
* Re-exported from `polars.io.vortex` and the top-level `polars` namespace
(so `pl.set_vortex_cache_bytes(...)` works).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`pl.scan_vortex(path).tail(N).collect()` and `pl.scan_vortex(path).slice(-N, M)` now push the slice through to Vortex's `ScanBuilder::with_row_range` instead of materializing every row and slicing post-decode. Implementation is trivial because the file's row count is already cached in `InitializedState` (it comes from the footer, free). Negative slices are converted to positive via `polars_utils::slice_enum::Slice::restrict_to_bounds` — the same helper Polars uses elsewhere. Then we hand the positive `Range<u64>` to `ScanBuilder::with_row_range` like any other slice. `VortexReaderBuilder` now advertises `ReaderCapabilities::NEGATIVE_PRE_SLICE` alongside `PRE_SLICE`, so the multi-scan layer stops splitting negative slices across files when the Vortex reader can handle them directly. The only remaining capability gap vs. Parquet is `EXTERNAL_FILTER_MASK` — that needs Vortex `Selection`-bitmap plumbing and is tracked under PR-13. Verified: `cargo check -p polars-stream --features vortex` passes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the convenience eager API:
import polars as pl
df = pl.read_vortex("input.vortex").filter(pl.col("a") > 0)
df.write_vortex("output.vortex")
Implemented as `self.lazy().sink_vortex(...)` with eager optimizations and
the streaming engine, mirroring how `df.write_ipc` and `df.write_csv` work.
The cloud sink path still errors at the Rust layer (write-side cloud is
pending); the `storage_options` / `credential_provider` parameters are
accepted for API symmetry but will surface that error when used.
This completes the Python write surface. After this commit the user-facing
API matrix is:
Read: pl.scan_vortex(path) (lazy)
pl.read_vortex(path) (eager)
Write: lf.sink_vortex(path) (streaming, lazy)
df.write_vortex(path) (eager, internally streaming)
Config: pl.set_vortex_cache_bytes(N) (segment-cache knob)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a comprehensive README.md for the new `polars-vortex` crate covering: * Quickstart — Rust dep snippet + Python end-to-end examples (read, filter, slice, write, set_vortex_cache_bytes). * "Why Vortex" — what makes the integration substantively different from "another Parquet": filter pushdown, negative-slice pushdown, decompressed-segment cache reuse. * "How it plugs into Polars" — the full architectural pipeline as a diagram, from `LazyFrame::scan_vortex(...)` through DSL→IR conversion, mem-engine delegation to polars-stream, `VortexReaderBuilder` → `VortexFileReader` → C-ABI bridge → Morsel emission. Sink path mirrored. * The four key design decisions (one Tokio runtime, C-ABI zero-copy bridge, the PolarsInstrumentedVortexReadAt decorator, ColumnPredicates → Vortex Expression) with code excerpts and rationale. * Cargo features table. * Configuration — segment cache (process-global, env-tunable), concurrency budget, cloud auth via `storage_options`. * Pushdown coverage table at a glance: ✅ for projection / pos+neg slice / Equal/Between/IN/StartsWith/EndsWith / zone-level pruning / hive / schema evolution / row index; ❌ residual for regex / arithmetic / CAST / struct field access (with reasons + tracking). * Crate layout diagram + IR/DSL touchpoint inventory. * Public API surface for both Rust and Python. * "What works today" checklist. * "Known limits / pending follow-ups" (cloud sink, write options plumbing through Python, aggressive AExpr pushdown, file-level optimizer stats). * End-to-end test recipes (roundtrip, filter-pushdown verification via explain, warm-cache benchmark). * "Pointers — reading the source" map for new contributors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bridge Polars' AsyncWriteable::Cloud (tokio::io::AsyncWrite) to Vortex's VortexWrite via `tokio_util::compat::Compat` + `vortex::io::AsyncWriteAdapter`. `VortexSink` is a small enum that the streaming sink hands to `VortexWriteOptions::write` regardless of whether the underlying target is a local `tokio::fs::File` (direct `impl VortexWrite`) or a cloud writer (compat-wrapped through Vortex's `AsyncWriteAdapter`). Collapsing both into one enum keeps `VortexWriteOptions::write`'s generic single-typed and lets the sink writer code stay generic-free. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update README "what works" / "known limits" sections to reflect that
`lf.sink_vortex("s3://...")` now works alongside local sinks via the
`VortexSink` enum bridge.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…size) to Python `lf.sink_vortex(...)` and `df.write_vortex(...)` previously always used `VortexWriteOptions::default()`. Now accept: - `compression`: "btrblocks" (default, adaptive per-column) or "uncompressed" - `layout`: "adaptive" (default), "flat", "chunked", or "zoned" - `chunk_size`: rows-per-chunk target (None lets Vortex pick) - `include_dtype`: whether to embed the Vortex DType in metadata (default True) Strings are parsed in `PyLazyFrame::sink_vortex`, with a clear error message on unknown values, so we don't expose Vortex's Rust enums through the Python ABI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p tests Previously `VortexWriteOptions` was accepted by `write_vortex` and the streaming sink but threaded through to nothing — both call sites built a fresh `session.write_options()` from scratch. The Python API surfaced layout/ compression knobs that didn't change a byte on disk. Reshape the options to match what Vortex's `WriteStrategyBuilder` actually configures: - `compression`: BtrBlocks (default, all schemes) vs Uncompressed (`BtrBlocksCompressorBuilder::empty()` — no schemes selected) - `row_block_size`: granularity of zone-level pruning (default 8192 rows). Renamed from `target_chunk_size`; the old name was misleading. - `include_dtype`: whether to embed the DType. Default `true` (the previous derived `Default` was silently `false`, which made reads fail with "doesn't embed a DType and none provided"). Drop `VortexLayoutKind` entirely — Vortex's strategy builder always produces a layered Flat→Chunked→Buffered→Zoned strategy; there is no "select your layout shape" knob to surface. Leaving it in the public API was misleading. New `write/strategy.rs::build_write_options` is the single place that turns our public options into Vortex's `VortexWriteOptions`. Used by both `write_vortex` (eager) and the streaming sink. Tests in `crates/polars-vortex/tests/roundtrip.rs` exercise default options, uncompressed, and small-block configurations; the `include_dtype` bug was caught by these. They live in polars-vortex (not crates/polars) so they don't need the streaming engine on the read side — they open the file directly through the Vortex session. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds tests for: - Schema preservation (verify field names match after open) - Nullable Int32 and Utf8 columns - Boolean - Empty DataFrame (zero rows) - include_dtype=false (smaller files; readers must supply DType out-of-band) Total roundtrip suite now 8 tests; all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
py-polars/tests/unit/io/test_vortex.py covers the user-facing API end-to-end via the Python → Rust → streaming engine pipeline: - Basic roundtrip (int / float / str columns) - Nullable columns - Uncompressed compression mode - Small row_block_size - Invalid compression error message - scan + filter - scan + projection - scan + negative slice (tail) A module-level skipif probes whether the binary was built with the `vortex` feature by attempting a tiny write — Python `scan_vortex` is always exported, so a hasattr check would be misleading. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- VortexWriteOptions: drop the fictional VortexLayoutKind enum reference, rename target_chunk_size → row_block_size, note manual Default impl - Add `write/strategy.rs` to the crate-layout sketch - Update sink_vortex / write_vortex signatures in the Python API section to show compression / row_block_size / include_dtype parameters - Drop the "VortexWriteOptions are not yet plumbed to Python" known-limit entry, which is now resolved Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `read/predicate.rs`: tests cover the bytes_to_like_literal wildcard-safety check (rejects %, _, \\, and invalid UTF-8) and the polars_scalar_to_vortex primitive coverage (bool / int / float / string). - `read/schema.rs`: tests cover the primitive PType → ArrowDataType mapping, Utf8/Binary → view variants, and the two top-level-Struct invariants we check at file-open time. Unit suite is now 9 tests, complementing the 8-test integration suite in tests/roundtrip.rs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "once the actual reader is implemented" comment was from the skeleton-phase commit and is stale — the reader landed in commits pola-rs#5–8. Replace with current-state explanation: why PARTIAL_FILTER is safe (full predicate is re-applied by multi-scan), and what FULL_FILTER / EXTERNAL_FILTER_MASK would require. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unit-test workflow lists each crate explicitly. Adding -p polars-vortex runs both the 9-test unit suite (predicate + schema convertor) and the 8-test integration suite (roundtrip) under --all-features (which enables the cloud feature too). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… hardening
Addresses the multi-section review report (see plan §15) in one pass:
**Dead options removed from VortexScanOptions** (and matching API surface):
- `use_statistics`, `aggressive_pushdown`, `push_projection` were declared but
never consulted anywhere on the Rust side, while exposed via Python with
defaults that misled users. Removed everywhere — `polars-vortex`,
`polars-lazy::ScanArgsVortex`, `polars-python::PyLazyFrame::new_from_vortex`,
and the `pl.scan_vortex` / `pl.read_vortex` Python signatures.
- `scan_concurrency` was similarly declared but unused; now actually wired into
`ScanBuilder::with_concurrency` in the streaming source.
**Cloud schema discovery** (`vortex_file_info` in polars-plan) now works for
`s3://` / `gs://` / `az://` paths — it routes through
`polars_vortex::read::read_at::cloud_read_at`, with a clean error if
polars-vortex was built without the `cloud` feature. Adds
`polars-vortex?/cloud` to the polars-plan `cloud` feature.
**Predicate scalar convertor** now handles `Date`, `Datetime`, `Time`, and
`Decimal` scalars (gated on the corresponding polars-core dtype-* features),
producing Vortex `Date` / `Timestamp` / `Time` extension scalars and
`DecimalValue::I128`. Polars `Duration` has no Vortex analogue → residual.
New polars-vortex features `dtype-date` / `dtype-datetime` / `dtype-time` /
`dtype-decimal` are propagated from `polars-lazy` so `polars/dtype-full`
turns them all on.
**C-ABI transmute hardening**:
- Strengthen compile-time checks: both `size_of` AND `align_of` asserts (size
alone wouldn't catch alignment divergence between the two `#[repr(C)]`
layouts).
- Runtime sanity check after every array import: the upstream-FFI `length`
field is snapshotted before transmute, then compared against
`imported.len()` after import. Catches a layout-incompatibility regression
cleanly instead of letting it propagate as silent corruption.
- Better doc comments: the read-side bridge explains why a transmute is the
practical option (polars-arrow's `ArrowArray` fields are `pub(super)` so we
can't construct one field-by-field from outside) and how the release-callback
handoff stays sound.
- The write-side bridges (per-array + schema-only in df_to_stream) get the
same treatment.
**Tests now verify data, not just metadata**: roundtrip tests build a
DataFrame, write it, scan it back via Vortex's `into_array_stream()` +
`record_batch_to_dataframe` C-ABI bridge, and assert `equals_missing`. A
buffer-bridging bug would now surface as a test failure rather than silent
mismatch. Helper `read_back()` mirrors the streaming reader's path.
**Schema-alignment validation** in `record_batch_to_dataframe`: column
count is now a hard error (was `debug_assert`), and field-name parity gets
a `debug_assert` so a future reorder-bug surfaces with a clear message rather
than as a downstream dtype mismatch.
**No-panic write path**: `df_to_stream::dataframe_to_vortex_chunks` no
longer relies on `top_dtype.expect("at least one chunk")`. The top-level
dtype is derived from the schema upfront, so n_chunks == 0 falls through
cleanly. The per-column `chunks().get(...)` now returns a proper PolarsError
on miss instead of `.expect(...)`.
**Misc cleanups**:
- Drop the `read/projection.rs` stub file and three dead `let _ = ...` patterns.
- Drop the `use UpstreamSchema as _` and `use IOMetrics as _` no-op imports.
- Refresh stale module-level doc and a stale capability comment.
- Python `_vortex_available()` skipif now catches only `AttributeError` /
`ComputeError` (not bare `except`), so genuine test bugs fail loud.
- `polars-python/vortex` feature now propagates `polars-vortex/serde` so the
lazyframe visitor's JSON serialization of Vortex scan options compiles.
- Annotate the load-bearing `<&str>` type hints in polars-sql with a comment
explaining the hashbrown 0.16 vs 0.17 type-inference conflict.
README synced with the current option set + pushdown coverage matrix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Drop stale `projection.rs` entry from the crate-layout sketch (file was removed in 33b56a8). - Expand the "What works today" list with cloud schema discovery, the Date/Datetime/Time/Decimal scalar pushdown gated on `dtype-*` features, and the now-wired `scan_concurrency` knob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che, row-count truncation, field metadata, naming) Addresses bugs identified during the fresh-eyes second-pass review: **Decimal negative-scale wraparound on read** (`read/schema.rs:71`): Vortex's `dec.scale()` returns `i8` and Arrow spec allows negative scales, but polars-arrow's `Decimal(usize, usize)` only represents non-negative scales. The bare `as usize` cast would silently wrap a negative scale to a huge value. Now explicitly errors with a clear message. **Decimal precision/scale wraparound on write** (`read/predicate.rs::decimal_scalar`): Vortex's `DecimalDType::new(u8, i8)` has narrower ranges than Polars' `Decimal(usize, usize)`. Bare `as u8` / `as i8` casts silently wrapped for out-of-range values. Now uses `try_into()` → returns `None` on overflow so the convertor falls back to residual (always correct). **`u64 as usize` row-count truncation** (`polars-stream vortex source`): On 32-bit platforms, files larger than 2^32 rows would silently truncate when computing the `restrict_to_bounds` argument for negative slices. Now clamped via `usize::try_from(...).unwrap_or(usize::MAX)`. **Dead `VortexScanOptions.cache` field** — now actually wired. Added `VortexCacheMode::resolve()` that returns the right `Arc<dyn SegmentCache>`: `Global` → the process-wide cache, `Off` → fresh NoOpSegmentCache, `Dedicated(N)` → fresh MokaSegmentCache. Streaming source's `initialize()` calls `self.options.segment_cache.resolve()` instead of always hitting the global cache. Field also renamed `cache` → `segment_cache` to avoid colliding with the LazyFrame query-cache `bool` field in ScanArgsVortex. **Field metadata dropped on write** (`write/array_bridge.rs`): `polars_chunk_to_upstream_record_batch` rebuilt `Field::new(name, dtype, nullable)` and threw away any custom metadata the polars-arrow Field carried. Now copies per-field metadata over to the upstream Field via `with_metadata(...)`. **Misc cleanups**: - Wrong doc comment in `read/array_bridge.rs` (`import_array_from_c` takes the FFI struct, not `Box<dyn Array>`). - Deterministic predicate iteration: sort by column name before `and_collect`-ing. Vortex's pruning evaluator may short-circuit left-to-right; non-deterministic hashmap order was a latent reproducibility hazard. - Double `to_arrow` schema computation in write path: split `polars_schema_to_vortex_dtype` into a `_from_arrow` variant so the eager writer can pass through the schema it already built. - Drop stale "Touch the global session" cargo-cult comment and a PR-3 reference that's now PR-13. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the test-coverage gap surfaced in the second review pass — the roundtrip suite previously exercised only 5 dtypes (Int64, Float64, Utf8, Bool, Int32/Utf8-nullable), and the predicate scalar helpers + schema converter branches were largely untested. **Unit tests for predicate scalar helpers** (`read::predicate::tests`): - `date_scalar_is_vortex_date_days` — verifies Date scalars produce `DType::Extension(vortex.date)`. - `datetime_scalar_units_map_correctly` — verifies all three Polars TimeUnits (Ns/Us/Ms) produce Vortex Timestamp scalars. - `datetime_scalar_with_timezone` — verifies the tz round-trip. - `time_scalar_is_nanoseconds` — verifies Polars Time → Vortex Time(Ns). - `decimal_scalar_roundtrips_precision_and_scale` — verifies the DecimalDType. - `decimal_scalar_rejects_overflowing_precision_and_scale` — locks in the new `try_into` behaviour (bug fix from c124312) by asserting that precision > 255 and scale > 127 return None. - `convertor_returns_pushable_for_date_predicate` — end-to-end via `convert_specialized`. **Schema converter tests** (`read::schema::tests`): adds the missing branches of `vortex_dtype_to_arrow_dtype`: - `null_dtype_maps_to_arrow_null` - `bool_maps_to_arrow_boolean` (both nullabilities) - `decimal_precision_chooses_128_or_256` (boundary at p == 38 vs 39) - `decimal_with_negative_scale_errors_cleanly` (locks in c124312 bug fix) - `list_maps_through_recursively` - `fixed_size_list_maps_through_recursively` - `struct_maps_to_arrow_struct` (nested fields, field-name + dtype assertions) - `date_extensions_map_to_arrow_date32_and_date64` (both time units) - `time_extensions_map_to_time32_or_time64` (all 4 time units) - `timestamp_with_and_without_timezone` (no-tz + UTC, two units) - `vortex_dtype_to_schema_round_trips_nullability` (field-level nullability) **Integration roundtrip suite** (`tests/roundtrip.rs`): widens dtype coverage: - All int and float widths reachable from `Column::new(...)` (Int32/64, UInt32/64, Float32/64). i8/i16/u8/u16 need polars-core's `dtype-i*` features which aren't part of the polars-vortex feature set; the schema-converter unit tests above cover those mappings. - NaN preservation (shape + dtype, since NaN != NaN by IEEE). - Binary (both non-null and nullable). - Multi-chunk DataFrame: builds a 3-chunk series and asserts the writer correctly concatenates across chunks. - Date (dtype-date), Datetime ×3 units no-tz (dtype-datetime), Time (dtype-time). Suite is now 28 unit tests + 18 integration tests (+ 15 without dtype features) = 46 Rust tests. All green under `--features dtype-date,dtype-datetime,dtype-time,dtype-decimal` and under the bare feature set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eanups, expanded tests
Addresses the remaining ⏳ items from the second review pass:
**Bug-adjacent fixes**:
- `polars-mem-engine/src/planner/lp.rs`: explicit Vortex branch sets
`create_skip_batch_predicate = false`. The previous default was implicit
(table_statistics is None for Vortex), which would have silently flipped
to true if PR-8 ever wires file-level stats. Vortex does zone pruning
inside ScanBuilder via `LayoutReader::pruning_evaluation`; running Polars'
per-row-group skip-predicate machinery on top would be wasted work and
risk double-pruning false positives.
**Cleanups**:
- Folded `read/metadata.rs` (a 10-line file with one type alias) into
`read/mod.rs`. Kept `polars_vortex::read::metadata::VortexFooterRef` as
a compatibility re-export so downstream `polars-plan` paths still resolve.
- Promoted `tokio-util` to a workspace dep. polars-vortex now uses
`tokio-util = { workspace = true, features = ["compat"], optional = true }`
consistent with the workspace's other tokio deps.
**Test coverage** — Suite now 42 unit + 23 integration = 65 Rust tests:
*VortexCacheMode::resolve() — 3 unit tests* (`read::options::cache_mode_tests`):
- `Global` returns the same Arc as `session::segment_cache()` across calls.
- `Off` returns distinct fresh NoOp caches per call, not the global.
- `Dedicated(N)` returns distinct fresh Moka caches per call.
*PolarsInstrumentedVortexReadAt — 3 unit tests* (`read::read_at::tests`):
- A `CountingReadAt` fake `VortexReadAt` records reads. The decorator must
forward each `read_at` call exactly once with the requested length.
- `concurrency()` and `coalesce_config()` correctly delegate to inner.
- `IOMetrics` instrumentation produces non-empty output after reads.
*Predicate convertor variant coverage — 8 unit tests*:
- Equal / Between / EqualOneOf / StartsWith (safe + unsafe bytes) / EndsWith
/ RegexMatch (residual) / EqualOneOf with partial failure (residual).
- Locks in the convertor's "all-or-nothing" semantics: if even one IN-list
scalar fails to convert (e.g., contains AnyValue::Null), we fall back to
residual rather than push a narrower predicate.
*Roundtrip — 4 new integration tests*:
- `roundtrip_datetime_with_timezone` — Datetime(Microseconds, UTC), 3 rows.
- `roundtrip_decimal_basic` — Decimal(10, 2), 4 rows including a negative.
- `roundtrip_decimal_with_nulls` — Decimal(8, 4) with mixed Some/None.
- `roundtrip_decimal_precision_38_boundary` — Decimal(38, 0) tests the
schema converter's `precision <= 38 → Decimal` boundary in production.
- `segment_cache_second_read_produces_same_data` — smoke test for the cache
wiring; two consecutive reads through the global cache must produce
identical results.
Adds `regex` to dev-dependencies for the RegexMatch test.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e 2 wall-clock anchor)
New `crates/polars/benches/io_vortex.rs` with 3 Criterion benches:
- `vortex_scan/no_filter` — baseline full-scan; identical perf on both
phases (no convertor involvement). Normalizes throughput across machines.
- `vortex_scan/filter_lt` — `col < N`. Phase 1: SpecializedColumnPredicate::Lt
pushes down. Phase 2: AExpr-direct convertor's `lt` arm pushes down. Both
paths reach Vortex's `LayoutReader::pruning_evaluation`; wall-clock should
be ~equal across phases.
- `vortex_scan/filter_arithmetic` — `col + 1 == N`. Phase 1: AExpr-direct
path is absent, convertor returns None, scan decodes everything and
Polars filters post-decode. Phase 2: convertor emits `eq(checked_add(...),
lit(N))` and Vortex skips zones. Key Phase 1 → Phase 2 measurement.
100k rows × 2 columns synthetic file; sufficient to span multiple Vortex
zones (~8,192 rows each by default) so zone-pruning has something to skip.
The bench is gated on `required-features = ["vortex", "lazy"]`; runnable
combo is the same as Phase 1's exit-criterion (a):
```sh
cargo bench -p polars \
--features vortex,cloud,parquet,dtype-full,strings \
--bench io_vortex
```
The `cloud,parquet,dtype-full,strings` set is needed because polars-stream's
`lower_expr.rs:1530+` references `IRStringFunction::Strptime` inside a
`dtype-date/datetime/time`-gated arm — that arm pulls in `strings` from
polars-plan transitively. Documented in the bench's module-level comment.
Saving baselines + comparing across phases:
```sh
# On Phase 1's tip:
cargo bench ... -- --save-baseline phase-1
# On a later commit:
cargo bench ... -- --baseline phase-1
```
**Dep additions** (polars umbrella crate, Cargo.toml):
- `criterion = "0.5"` (dev-dep, no default features, `cargo_bench_support`
only — minimum surface for `criterion_group!` / `criterion_main!`)
- `polars-vortex` (dev-dep, for `write_vortex` in bench setup; required-features
gates the bench so polars-vortex only compiles when the bench would run)
- `tempfile = "3"` (dev-dep, mirrors polars-vortex's existing dev-dep usage)
- New `[[bench]]` entry for `io_vortex`
**Public-API surface fix** (`crates/polars-lazy/src/prelude.rs`):
- Re-export `ScanArgsVortex` from polars-lazy's prelude under `#[cfg(feature = "vortex")]`.
Pre-PR-1.5 the type was reachable only through `polars_lazy::scan::vortex::ScanArgsVortex`
but `scan` is `pub(crate)`, so external Rust callers couldn't construct it.
Py-polars works around this by calling `DslBuilder::scan_vortex` directly,
bypassing the public `LazyFrame::scan_vortex` API. The bench's external
caller pattern is the actual user-facing usage; the re-export closes the
gap so future Rust callers (other benches, integration tests, downstream
crates) can construct ScanArgsVortex.
**README update** (`crates/polars-vortex/README.md`):
Added a "Run the Vortex Criterion benches" section to the existing "Test
recipes" listing, with the canonical `--save-baseline` / `--baseline`
invocation for cross-phase comparison.
**Verification**:
- `cargo check -p polars-lazy --features vortex` clean.
- `cargo check -p polars --benches --features vortex,lazy,cloud,parquet,dtype-full,strings` clean.
- `cargo fmt --all -- --check` clean.
Not run locally (would require ~30s+ wall-clock):
- Actual `cargo bench` execution — verified at compile + setup; saved as
the user's next-step measurement to capture the Phase 1 baseline numbers
before reviewing Phase 2's perf wins.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The uncompressed lib size after this PR is 59.3437 MB. |
|
The uncompressed lib size after this PR is 59.5097 MB. |
…ile-stats + multi-file tests into Phase 1 Re-stack: bring PR-3.1 + PR-3.2 work (originally on the Phase 2 follow-on branch) into the Phase 1 PR so the Phase 1 PR ships a complete, robust, tested Vortex integration as a single reviewable unit, with Phase 2's AExpr-direct convertor as a pure perf follow-on. Origin commits on vortex-integration branch (preserved as backup-vortex-integration-pre-restack): 746cc3e — PR-3.1 cycle 1 file-stats foundation d1dda5e — PR-3.1 cycle 2 must-fix (n_sources==1 gate) fad435b — PR-3.2 cycle 1 multi-file + missing_columns tests eef9643 — PR-3.2 cycle 2 (removed broken extra_columns tests) Changes: 1. New module crates/polars-vortex/src/read/file_stats.rs (~200 LoC impl + ~110 LoC tests, copied verbatim). Helper footer_to_table_statistics(footer, file_schema) walks Vortex's FileStatistics::stats_sets() and builds a single-row DataFrame matching the {len, {col}_min, {col}_max, {col}_nc} contract. Soundness: only Precision::Exact values emitted; Inexact and missing become null (mem-engine treats as "unknown range"). Handles all primitive dtypes + Boolean + String. 2. Extended vortex_file_info in scans.rs to return (FileInfo, Option<VortexFooterRef>, Option<DataFrame>) where the 3rd element is the per-file stats DataFrame. Stats are extracted BEFORE row_index synthesis so the schema matches FILE columns. 3. Caller wires unified_scan_args.table_statistics = Some(TableStatistics(Arc::new(stats_df))) gated on n_sources == 1. Vortex is the FIRST Polars format to populate this — Parquet hard-codes None at polars-plan/src/dsl/file_scan/mod.rs. 4. Removed the Vortex-specific force-disable of create_skip_batch_predicate at crates/polars-mem-engine/src/planner/lp.rs. The two pruning layers (file-level skip_batch_predicate via {col}_min/max/nc DataFrame vs Vortex's zone-level LayoutReader::pruning_evaluation inside ScanBuilder) are complementary, not redundant. 5. Added 6 Python e2e tests to test_vortex.py: - test_scan_with_file_stats_smoke — single-file populate path - test_scan_with_file_stats_multifile_does_not_panic — verifies n_sources==1 gate prevents the original cycle-1 hard panic at polars-mem-engine/src/scan_predicate/functions.rs:397 - test_multifile_scan_shape_and_ordering — path-list ordering - test_multifile_scan_missing_columns_insert/_raise — schema evolution policy plumb-through 10 new unit tests added (in file_stats.rs). Cycle 1 limitations (carry-forward from origin work): - Single-file scans only — multi-file stats aggregation tracked as Deferred work (~50 LoC follow-up to read all source footers in parallel). - extra_columns is no-op for non-Parquet formats per upstream Polars limitation at lower_ir.rs:925-936 — tracked as Deferred. Verification: cargo check -p polars --features vortex,cloud,parquet,dtype-full,strings → clean cargo test -p polars-vortex --lib file_stats → 10 passed cargo fmt --check → clean uv tool run ruff check py-polars/ → clean
…orward fixes (4 items)
PR-2.0 housekeeping commit 2 of 3. Addresses 4 cycle-3 should-fix code-doc
carry-forward items:
1. Producer-error inline comment at vortex sink mod.rs:127-137 trimmed from
18 lines (~370 chars) to 9 lines, removing the inaccurate "whichever
fires first" framing. The producer/writer error path is deterministic
per await ordering (producer.await? sees the producer's terminal value
before write_handle.await runs); user-visible error is always the
producer's PolarsError, channel-Err is defense in depth. Aligns with
cycle-2 correctness lens trace.
2. lib.rs:14-21 narrowing comment replaced the path-by-path enumeration
(which drifted: claimed "8 paths" but enumerated 12) with the simpler
invariant: "anything outside vortex::{array, error, file, io, layout}
fails to compile". Plus a re-audit instruction for adding new paths.
3. read/predicate.rs:1-8 added TODO(PR-2.6) scaffolding marker explicitly
naming the file as the deletion target for the PR-13 Option B → A
cutover, with the call-site at io_sources/vortex/mod.rs:242 referenced
for the switch.
4. io_sinks/writers/vortex/mod.rs:115 added a 6-line comment documenting
the morsel_rx.recv() Err-as-EOS contract: upstream-channel-drop is
clean EOS; producer errors flow via the dedicated chunk_tx channel +
spawned-task Err return, NOT through morsel_rx. Names the Polars-wide
sink-task pattern (CSV/IPC/NDJSON/Parquet all do the same).
Verified clean: `cargo check -p polars-stream --features vortex,cloud`
finishes in 4s with only pre-existing warnings (no new clippy/check
issues introduced).
(cherry picked from commit 3d8da4d)
…Dedicated single-cache via FileScanIR thread-through
Closes cycle-3's "Dedicated double-resolve" should-fix (maint
hidden-assumption + arch fragmentation; same root cause from two
angles). Pre-PR-2.0, `VortexCacheMode::Dedicated(N).resolve()` was
called TWICE per logical scan — once at IR-build for the postscript
schema-discovery read, once at streaming-source-time for the data
read — producing TWO independent Moka cache instances per scan.
Segments fetched during discovery didn't carry into the data read;
the user paid 2x cache allocation cost AND lost the discovery →
data-read prefetching benefit. Strictly farther from documented
Dedicated semantics ("one per-scan cache").
This commit threads one resolved cache through the IR. Files touched:
1. polars-vortex/src/read/mod.rs: introduce VortexSegmentCacheRef
newtype wrapping Arc<dyn SegmentCache>. Newtype (not type alias)
because dyn SegmentCache doesn't impl Debug — wrapper provides an
opaque Debug impl, Deref to the inner Arc, and From<Arc<dyn>>.
2. polars-plan/src/dsl/file_scan/mod.rs: add segment_cache: Option<
VortexSegmentCacheRef> field to FileScanIR::Vortex (Some at the
schema-discovery-read path; None at the user-supplied-schema path).
Add the same to FileScanEqHashWrap::Vortex (pointer-equality via
the existing arc_as_ptr helper) so plan-cache identity is
preserved.
3. polars-plan/src/plans/conversion/dsl_to_ir/scans.rs: resolve once
at the caller of vortex_file_info; pass the cloned wrapper to both
the IR-build postscript read AND into FileScanIR::Vortex.segment_
cache for downstream consumption. Function-level doc-comment added
to vortex_file_info naming the contract; use the type alias at the
signature site to drop the unwieldy fully-qualified path.
4. polars-stream/src/physical_plan/lower_ir.rs: destructure
segment_cache and thread to VortexReaderBuilder.
5. polars-stream/src/nodes/io_sources/vortex/builder.rs: add
segment_cache field to VortexReaderBuilder; pass through to
VortexFileReader at build_file_reader time.
6. polars-stream/src/nodes/io_sources/vortex/mod.rs: add segment_cache
field to VortexFileReader; in initialize(), prefer the threaded
cache and fall back to options.segment_cache.resolve() only on the
None path (user-supplied-schema case where no IR-build postscript
read happened). Global/Off unaffected because .resolve() is
idempotent for those variants.
7. polars-mem-engine/src/scan_predicate/functions.rs: pattern-match
update for the new field.
Plan-doc tweak: corrected the polars-vortex crate inventory at
plan:96 to actual breakdown (43 unit + 23 integration = 66 Rust
tests; verified via cargo test output).
Verified:
- cargo check -p polars --features vortex,cloud,parquet,dtype-full clean
- cargo check -p polars-stream --features vortex,cloud clean
- cargo check -p polars-python --features new_streaming clean
- cargo test -p polars-vortex --features dtype-date,dtype-datetime,
dtype-time,dtype-decimal → 43 + 23 = 66 tests pass (no regressions)
(cherry picked from commit 5787a66)
…e + drop on cache pressure (review finding) (cherry picked from commit 2a0664c)
…ion tests (review finding) PR-2.0 cycle-1 must-fix C-003: PR-2.0 acceptance criterion (c) requires a Dedicated(N) regression test verifying single-cache identity through the IR thread-through; cycle-1 gauntlet correctness-lens flagged the contradiction with the Deferred-work entry's "STILL DEFERRED" framing. This commit adds 3 unit tests in crates/polars-vortex/src/read/mod.rs: 1. `clone_preserves_arc_identity` — VortexSegmentCacheRef::clone must NOT clone the inner Arc<dyn SegmentCache>. If it did, the IR thread-through (which clones across VortexReaderBuilder → VortexFileReader per file) would silently re-introduce the cycle-3 double-cache bug. 2. `from_arc_preserves_identity` — `Into::into` must be a thin newtype wrap, not a clone-then-allocate. 3. `deref_returns_inner_arc_by_reference` — `Deref` must hand out `&Arc<dyn SegmentCache>` pointing at the SAME inner Arc. All three verify the single-cache invariant at the wrapper level. The corresponding end-to-end integration test (`pl.scan_vortex(..., cache_mode=Dedicated(N))` confirming ONE Moka cache instance across IR-build + streaming reads) is deferred to PR-3.1's table_statistics work — PR-3.1 already touches vortex_file_info and is the natural place to add the end-to-end harness. Test count: 43 unit + 23 integration = 66 → 46 unit + 23 integration = 69. Verified: - All 3 new tests pass - cargo test -p polars-vortex --features dtype-date,dtype-datetime,dtype-time,dtype-decimal: 46 + 23 = 69 tests, 0 failures (cherry picked from commit 869120d)
… FileScanIR::Vortex (vortex+python compile error; review finding) (cherry picked from commit 8036f13)
…3 must-fix; CI rustfmt green-up) (cherry picked from commit b2252b1)
2026-05-19 PR-stack restructure: this branch (vortex-integration-phase-1) now contains Phase 1 foundation + Phase 3 file-stats / multi-file tests + Phase 2 PR-2.0 cleanups (segment_cache thread-through + code/plan-doc carry-forwards + C-001/C-002/C-003/C2-001 fixes). The remaining Phase 2 PR (branch vortex-integration) is now a pure AExpr-direct convertor follow-on stacked on this Phase 1++ tip. Backup branches preserve pre-restack state: - backup-vortex-integration-phase-1-pre-restack (this branch's pre-restack tip) - backup-vortex-integration-pre-restack (Phase 2 branch's pre-restack tip) Plan rewrite + full Implementation status update deferred to next session (PR-3.3 implementation + Phase 4 enhanced benches happen in fresh context). Current State header updated to reflect the new structure; the rest of the plan doc (Architecture, Phases and PRs, Implementation status, Deferred work, etc.) is stale relative to the new structure and will be fully rewritten next session.
|
The uncompressed lib size after this PR is 59.5099 MB. |
…d reopened to match phase order)
|
The uncompressed lib size after this PR is 59.5099 MB. |
…ex-integration-phase-2
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Native Vortex file-format support for Polars, as a peer of Parquet. Read and write across local, cloud, and in-memory sources; schema discovery; multi-file scans with hive partitioning and
missing_columnspolicy; projection pushdown; filter pushdown for equality, range, IN-list, and LIKE prefix/suffix predicates; positive and negative slice pushdown; per-scan and process-global tuning of Vortex's decompressed-segment cache; and per-file statistics populated from the Vortex footer so the mem-engine prunes whole files before the streaming source opens them.Code lives almost entirely in a new
crates/polars-vortex/crate so leaf crates don't pull Vortex's dep graph; the cross-crate touchpoints are additive and feature-gated behindvortex. The default build is unaffected. A Criterion bench atcrates/polars/benches/io_vortex.rsanchors a perf baseline for the AExpr-direct filter pushdown work in #4, which stacks on this branch.🤖 Generated with Claude Code