Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54) by lukekim · Pull Request #65 · spiceai/vortex

lukekim · 2026-06-12T22:51:54Z

What

Establishes spiceai-54 as the DataFusion 54 base by bringing the upstream Vortex 0.75.0 upgrade onto it. This is a 2-parent merge of the 0.75.0 tag (115 upstream commits, 658 files) onto the fork. The headline upstream change is DataFusion 53 → 54 (incl. the Arrow Utf8View/BinaryView migration).

spiceai-54 was forked off spiceai-53, which in the meantime gained the merged perf PRs #61 (per-ExecutionCtx ArrayKernels snapshot) and #62 (intra-file split sub-division) — see the reconciliation note below.

⚠️ Status — reconciliation required before merge

GitHub shows this PR as MERGEABLE (no textual conflicts), but the auto-merge will not compile as-is. spiceai-54 includes #61/#62 in their DF53 forms, which clash semantically with 0.75.0's refactor:

perf(executor): per-ExecutionCtx ArrayKernels snapshot #61's ArrayKernels::snapshot() calls self.execute_parent.load_full(), but 0.75.0 refactored execute_parent to ArcSwapMap, which has no load_full. The line-wise auto-merge keeps both, producing a call to a non-existent method.
split_by.rs / vortex-file tests from perf(scan): intra-file decode parallelism — sub-split large chunk spans #62 likewise need verification against 0.75.0's versions.

The fix is already prepared: the DF54-compatible forms of #61/#62 are the upstream develop ports (#8401 added load_full to ArcSwapMap + adapted snapshot(); #8400 is the 0.75.0-compatible subdivision). Reconciliation = merge spiceai-54 into this branch and apply those forms, then rebuild + retest.

Conflicts resolved (the 0.75.0 merge itself)

vortex-datafusion/convert/exprs.rs — adopted upstream 0.75.0's get_field pushdown semantics (can_scalar_fn_be_pushed_down checks only for GetFieldFunc) plus the new is_dynamic_physical_expr guard and direct downcast_ref API. The fork's stricter "all args must be pushable" variant was dropped — see Behavior note.
vortex-datafusion/persistent/sink.rs — kept the fork's direct multi-file VortexSink writer (tasks return Vec<(Path, WriteSummary)>, honoring target_file_size + numbered-path extension), via the intact writer_dtype helper.

Behavior note

Under DF54's Utf8View migration, the fork's stricter can_scalar_fn_be_pushed_down forced nested get_field over struct columns into the DataFusion-side leftover projection, where evaluating it over the Utf8View scan output mismatched the planned Utf8 type — failing 5 upstream nested-projection / schema-evolution pushdown tests. Realigning to upstream's design (which pushes get_field to Vortex and reconciles types via calculate_physical_schema) fixes all 5.

Post-merge build fix

vortex-array arrow ArrowMapArray — propagate the now-fallible nulls() result with ? (0.75.0 changed nulls() to return VortexResult<Validity>).

Testing

Validated the 0.75.0 merge itself (this branch's head, based on spiceai-53 before #61/#62):

cargo build --workspace — clean against DataFusion 54.
cargo nextest run --workspace --no-fail-fast — 6135 passed, 0 failed, 522 skipped.

⚠️ The spiceai-54 reconciliation (folding in #61/#62) is not yet built or tested — that's the outstanding work above.

Reviewer note

The dependency tree currently resolves both datafusion 53.1.0 and 54.0.0; it compiles and tests clean, but worth confirming it's intentional vs. a stray transitive pin.

…a#8208) ## Summary This PR introduces `BitBufferView<'a>` and `BitBufferMutView<'a>` as borrowing analogues to the owned `BitBuffer` and `BitBufferMut` types. These new types enable zero-copy reading and in-place modification of packed bitsets without cloning the underlying `ByteBuffer`. This View will replace BitBuffer after the migration. --------- Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>

## Summary Introduce `LayoutReaderContext`, a typed-data registry threaded through reader construction so an ancestor layout can publish values that descendant layouts look up by type at construction time. Breaking: Layout::new_reader and VTable::new_reader gain a &LayoutReaderContext parameter. Out-of-tree impls will get a compile error and must add it --------- Signed-off-by: Onur Satici <onur@spiraldb.com>

…ata#8072) ArcSwap is faster than a lock for read. These session are mutable but mutations are rare and retrievals are common --------- Signed-off-by: Robert Kruszewski <github@robertk.io>

## Summary `MemorySession` is registered lazily. Because it implements default we don't need to register it to the session and the first call to `ctx.allocator()` would hold the session dashmap shard lock and insert the allocator. The problem is when we are in a aggregate function we are already potentially holding the same dashmap shard's lock, because the aggregate functino registry is also read from the session. because session vars are type id keyed and type id's change on every build, if the aggregate function registry and memory session end up being on the same dashmap shard, we deadlock running an aggregate function. This is fixing with a band aid, I am eagerly registering memory session because that is the only one that we forgot to register, hence we won't ever write lock the session after initialising now. I think the right solution is to not hold the read lock of the dashmap shard while executing kernels, which I will do in a follow up Signed-off-by: Onur Satici <onur@spiraldb.com>

…ortex-data#8222) ## Summary This PR has two changes: 1. Bumps the FSST dependency to include 0.5.11 which has faster `Compressor::rebuild_from`. 2. When executing an FSST array through slice/filter/take, we keep the compressor instead of creating new lazy instances, as part of a `FSSTSymbolTable` type. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

This can be very cheap with ListViews and ends up being very sad because we use a builder by default --------- Signed-off-by: Robert Kruszewski <github@robertk.io>

I think this is a bit messy but I don't know how to make it better in the C API This is being added to support positional deletes in iceberg Signed-off-by: Robert Kruszewski <github@robertk.io>

## Summary Seems to be the main difference between the impl we have and std, docs - https://doc.rust-lang.org/std/hint/fn.select_unpredictable.html Doesn't seem to make a difference in our benchmarks, probably worth further investigation. Signed-off-by: Adam Gutglick <adam@spiraldb.com>

Reuse them from parent layout. Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

@AdamGS

## Summary I/O tasks are likely to cause a lot of tracing to be produced. With an explicit span on them, we can control it more precisely./ ## API Changes No API changes. ## Testing No testing, this just allows consumers to configure tracing at a more granular level. @AdamGS Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>

## Summary Turns out we ran into the maximum github comment size, so time to make it smaller. Signed-off-by: Adam Gutglick <adam@spiraldb.com>

Avoid re-parsing flatbuffers when you re-access the same children Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

## Summary Opptimizes the hot path for constructing binary views when the entire decoded heap fits within a single buffer. ### Changes **Optimized view construction** (`vortex-array/src/arrays/varbinview/build_views.rs`): - Fast path for the common case where the entire decoded heap fits in a single buffer - Eliminates per-element rollover checks and out-of-line `BinaryView::make_view` calls in the hot loop - Constructs reference views inline for long strings (>4 bytes) and inlined views for short strings - Reduces branch mispredictions and improves cache locality during view construction --------- Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Claude <noreply@anthropic.com> Signed-off-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Claude <noreply@anthropic.com>

## Summary Adds CUDA Arrow Device export for Vortex `ListViewArray` by exporting it as standard Arrow `List` for cuDF. - GPU export path for contiguous list-views using device-built `i32` Arrow offsets - GPU rebuild path for supported non-contiguous primitive list-views - Host fallback via CPU `ListViewArray` → `ListArray` rebuild when GPU export cannot handle the shape - CUB exclusive-scan wrapper used by the rebuild path - New CUDA kernels, tests, e2e coverage, and benchmarks for list-view export --------- Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Re-creating interner IDs is an issue for random access Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Wipe the API lock file, as we're not using them anymore. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

## Summary Removes the assertion in the `nulls` function so that `from_arrow` errors instead of panicking if the `nullable` does not match the actually nullability of the arrow array. Also documents the `FromArrowArray` trait and method. ## Testing 2 basic unit tests. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

Fixes vortex-data#8189 Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

…g masks pass them by reference (vortex-data#8221) At the same time the set methods can initialize the validity buffer from the passed mask instead of allocating and copying --------- Signed-off-by: Robert Kruszewski <github@robertk.io>

…ata#8213) ## Summary split registration only cared about the immediate accessed fields of the projection and filter expression. We do handle these nested expressions correctly at layout readers, but we did still register splits as if we are referencing all nested fields. so if we had a `select a.x, a.y`, before, we were registering splits for `Prefix(a)`, meaning all fields under a would have their splits registered, ending up with many more tasks than needed. Now we do pass `Prefix(a.y) | Prefix(a.x)` correctly to splits --------- Signed-off-by: Onur Satici <onur@spiraldb.com>

…ortex-data#8238) ## Summary Adds `bitpack_compare_sweep`, a benchmark exercising the **public** `array.binary(rhs, op)` compare-against-constant path over **all eight integer types** and **every valid bit width** (64Ki in-range elements per case, no patches, no out-of-range fast path). It isolates the `<BitPacked as CompareKernel>` unpack + per-element compare kernel.

## Summary zip kernels are generally slower than fill_null kernels, and `CASE WHEN is_null(x) THEN c ELSE x END` needs to resolve x twice, whereas `fill_null(x, c)` does resolve x once --------- Signed-off-by: Onur Satici <onur@spiraldb.com>

## Summary This PR introduces a new JSON extension type, with more work building on top of it planned in the future. This is a building block towards the variant-based compressor and similar work. ## API Changes Includes a minor change, unifiying the two `EmptyMetadata` types we currently have into one. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

## Summary This PR builds on top of vortex-data#8125 to add support for ParquetVariant arrays, as a step towards variant support for Iceberg. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [starlette](https://redirect.github.com/Kludex/starlette) ([changelog](https://starlette.dev/release-notes/)) | `0.52.1` → `1.0.1` | ![age](https://developer.mend.io/api/mc/badges/age/pypi/starlette/1.0.1?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/starlette/0.52.1/1.0.1?slim=true) | --- > [!WARNING] > Some dependencies could not be looked up. Check the [Dependency Dashboard](..vortex-data/issues/357) for more information. --- ### Starlette has missing Host header validation that poisons request.url.path, bypassing path-based security checks [CVE-2026-48710](https://nvd.nist.gov/vuln/detail/CVE-2026-48710) / [GHSA-86qp-5c8j-p5mr](https://redirect.github.com/advisories/GHSA-86qp-5c8j-p5mr) <details> <summary>More information</summary> #### Details ##### Summary In affected versions, the HTTP `Host` request header was not validated before being used to reconstruct `request.url`. Because the routing algorithm relies on the raw HTTP path while `request.url` is rebuilt from the `Host` header, a malformed header could make `request.url.path` differ from the path that was actually requested. Middleware and endpoints that apply security restrictions based on `request.url` (rather than the raw `scope` path) could therefore be bypassed. ##### Details When a client requests `http://example.com/foo`, it sends: ```http GET /foo HTTP/1.1 Host: example.com ``` Affected versions reconstructed the URL by concatenating `http://{host}{path}` and re-parsing the result. The `Host` value is only valid as a `uri-host [ ":" port ]` per [RFC 9112 §3.2](https://www.rfc-editor.org/rfc/rfc9112.html#section-3.2-6), where `uri-host` follows the restricted `host` grammar of [RFC 3986 §3.2.2](https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2). When it contains characters outside that grammar - notably `/`, `?`, or `#` - those characters move the path/query/fragment boundaries during re-parsing, so the parsed `request.url.path` no longer matches the path the server actually received. For example: ```http GET /foo HTTP/1.1 Host: example.com/abc?bar= ``` reconstructs to `http://example.com/abc?bar=/foo`, whose parsed `path` is `/abc` - even though routing used the real path `/foo`. The router still dispatches to `/foo` and the endpoint executes, but any middleware or code that reads `request.url.path` sees `/abc`, so path-based authorization checks can be bypassed. ##### Impact Any application running an affected version that relies on `request.url` (or `request.url.path`) for security-sensitive decisions is affected. The most common case is middleware that gates access to certain path prefixes based on `request.url.path`. Deployments fronted by a proxy or load balancer are mitigated only if that proxy rejects or normalizes the malformed `Host` header before forwarding and the application does not trust attacker-controlled host headers (e.g. `X-Forwarded-Host`) elsewhere. ##### Mitigation Upgrade to a patched version, which validates the `Host` header against the grammar of [RFC 9112 §3.2](https://www.rfc-editor.org/rfc/rfc9112.html#section-3.2-6) / [RFC 3986 §3.2.2](https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2) when constructing `request.url` and falls back to `scope["server"]` for malformed values. #### Severity - CVSS Score: 6.5 / 10 (Medium) - Vector String: `CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:N` #### References - [https://github.com/Kludex/starlette/security/advisories/GHSA-86qp-5c8j-p5mr](https://redirect.github.com/Kludex/starlette/security/advisories/GHSA-86qp-5c8j-p5mr) - [https://nvd.nist.gov/vuln/detail/CVE-2026-48710](https://nvd.nist.gov/vuln/detail/CVE-2026-48710) - [https://github.com/Kludex/starlette/commit/764dab0dcfb9033d75442d7a359645c9f94648c6](https://redirect.github.com/Kludex/starlette/commit/764dab0dcfb9033d75442d7a359645c9f94648c6) - [https://badhost.org](https://badhost.org) - [https://github.com/pypa/advisory-database/tree/main/vulns/starlette/PYSEC-2026-161.yaml](https://redirect.github.com/pypa/advisory-database/tree/main/vulns/starlette/PYSEC-2026-161.yaml) - [https://ostif.org/disclosing-the-badhost-vulnerability-in-starlette](https://ostif.org/disclosing-the-badhost-vulnerability-in-starlette) - [https://www.secwest.net/starlette](https://www.secwest.net/starlette) - [https://www.x41-dsec.de/lab/advisories/x41-2026-002-starlette](https://www.x41-dsec.de/lab/advisories/x41-2026-002-starlette) - [https://github.com/advisories/GHSA-86qp-5c8j-p5mr](https://redirect.github.com/advisories/GHSA-86qp-5c8j-p5mr) This data is provided by the [GitHub Advisory Database](https://redirect.github.com/advisories/GHSA-86qp-5c8j-p5mr) ([CC-BY 4.0](https://redirect.github.com/github/advisory-database/blob/main/LICENSE.md)). </details> --- ### Release Notes <details> <summary>Kludex/starlette (starlette)</summary> ### [`v1.0.1`](https://redirect.github.com/Kludex/starlette/releases/tag/1.0.1): Version 1.0.1 [Compare Source](https://redirect.github.com/Kludex/starlette/compare/1.0.0...1.0.1) #### What's Changed - Ignore malformed `Host` header when constructing `request.url` by [@&vortex-data#8203;Kludex](https://redirect.github.com/Kludex) in [#&vortex-data#8203;3279](https://redirect.github.com/Kludex/starlette/pull/3279) **Full Changelog**: <Kludex/starlette@1.0.0...1.0.1> ### [`v1.0.0`](https://redirect.github.com/Kludex/starlette/releases/tag/1.0.0): Version 1.0.0 [Compare Source](https://redirect.github.com/Kludex/starlette/compare/0.52.1...1.0.0) Starlette 1.0 is here! 🎉 After nearly eight years since its creation, Starlette has reached its first stable release. A special thank you to [@&vortex-data#8203;lovelydinosaur](https://redirect.github.com/lovelydinosaur), the creator of Starlette, Uvicorn, HTTPX and MkDocs, whose work helped to lay the foundation for the modern async Python ecosystem. 🙏 Thank you to [@&vortex-data#8203;adriangb](https://redirect.github.com/adriangb), [@&vortex-data#8203;graingert](https://redirect.github.com/graingert), [@&vortex-data#8203;agronholm](https://redirect.github.com/agronholm), [@&vortex-data#8203;florimondmanca](https://redirect.github.com/florimondmanca), [@&vortex-data#8203;aminalaee](https://redirect.github.com/aminalaee), [@&vortex-data#8203;tiangolo](https://redirect.github.com/tiangolo), [@&vortex-data#8203;alex-oleshkevich](https://redirect.github.com/alex-oleshkevich), [@&vortex-data#8203;abersheeran](https://redirect.github.com/abersheeran), and [@&vortex-data#8203;uSpike](https://redirect.github.com/uSpike) for helping make Starlette what it is today. And to all my sponsors - especially [@&vortex-data#8203;tiangolo](https://redirect.github.com/tiangolo), [@&vortex-data#8203;huggingface](https://redirect.github.com/huggingface), and [@&vortex-data#8203;elevenlabs](https://redirect.github.com/elevenlabs) - thank you for your support! Thank you to all [290+ contributors](https://redirect.github.com/encode/starlette/graphs/contributors) who have shaped Starlette over the years! ❤️ Read more on the [blog post](https://marcelotryle.com/blog/2026/03/22/starlette-10-is-here/). Check out the full release notes at <https://www.starlette.io/release-notes/#100-march-22-2026> *** **Full Changelog**: <Kludex/starlette@1.0.0rc1...1.0.0> </details> --- ### Configuration 📅 **Schedule**: (UTC) - Branch creation - At any time (no schedule defined) - Automerge - At any time (no schedule defined) 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR was generated by [Mend Renovate](https://mend.io/renovate/). View the [repository job log](https://developer.mend.io/github/vortex-data/vortex).  Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>

## Summary grouped sum fallback should index the groups correctly when summing

…ranspose (vortex-data#8239) ## Summary Replaces the unpack-then-compare streaming kernel for compare-against-constant with the FastLanes fused `unpack_cmp`: - compare each value **as it is unpacked**, accumulating results straight into a transposed 1024-bit mask (`[u64; 16]`, one register-resident word per lane — no `[bool; 1024]`/`[T; 1024]` scratch), - a single SIMD `untranspose_bits` per block rotates the mask into logical row order, copied directly into the output bit buffer, - inline patches are spliced in afterwards; sliced (`offset != 0`) arrays fall back to the scalar streaming predicate. --------- Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk> Co-authored-by: Claude <noreply@anthropic.com>

## Summary This PR adds `#[inline]` hints to a collection of small, frequently-called functions across the codebase to improve performance. These are primarily simple wrapper methods, trait implementations, and utility functions that benefit from inlining to reduce function call overhead. These are all candidates for inlining because they are: 1. Small wrapper functions with minimal logic 2. Called frequently in hot paths (e.g., binary search, array access) 3. Generic or trait methods where inlining enables better monomorphization 4. Simple accessors and type checks

## Summary Adds import/export from `vortex-json` to Arrow's JSON canonical extension type. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

Convert the streaming list_view kernels (offsets check, rebuild init scan, offsets validation) and decimal_cast from per-thread-contiguous element ranges to block-stride loops so warp accesses stay coalesced. On GH200 the contiguous-offsets check on 10M lists drops from 718us to 80us (~1.4 TB/s, 9x) and the take-based rebuild path improves by 35%. The rebuild gather kernel keeps its per-list layout since its access pattern is data-dependent. Also enqueue the status and total-bytes device-to-host copies in the Arrow Binary export before awaiting either, so both readbacks complete in one stream round-trip instead of two. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

## Summary Some minor touchups to the `FileSource` based DataFusion integration. The two substantial changes here are: 1. When reading files, if we didn't have the footer in the cache, make sure to insert it. That can happen when using `ListingTable` without stats inference, or when using `FileScanConfig` directly in another table provider. 2. On write - move the schema-to-dtype logic outside of the loop. It only needs to happen once and the `dtype` is cloned per write task. Signed-off-by: Adam Gutglick <adam@spiraldb.com>

If duckdb-vortex is supplied, don't attempt to download duckdb Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

…tex-data#8326) Flagged by the fuzzer and likely doesn't happen in practice but better to be defensive here in case we add complex partials this is alternative version of vortex-data#8302

vortex-data#8369) ## Summary Closes: vortex-data#8366 `VortexFile::can_prune` stopped pruning for every `and`/`or` and `eq` predicate after vortex-data#7575 removed bottom-up constant-folding. This change makes `can_prune` mirror that read-out (execute to `Canonical`, take the row-0 scalar) and adds an end-to-end regression test covering bare, `and`/`or`, and `eq` predicates with non-falsifiable controls. --------- Signed-off-by: Thomas Santerre <thomas@santerre.xyz>

Adds `f64` cases to the `aggregate_sum` and `aggregate_grouped` benchmarks in `vortex-array`, to establish a baseline before changing float summation to Kahan (Neumaier) compensated summation. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>

…ortex-data#8365) `Mean::finalize_scalar` returned null when the count was zero, while the array `finalize` path computes `sum/count = 0/0 = NaN` for the same input. A mean over an all-null group therefore gave different results depending on which accumulator we're using. https://github.com/vortex-data/vortex/blob/90d743356722a8d5ca7e39053229654778bacf46/vortex-array/src/aggregate_fn/fns/mean/mod.rs#L85-L93 Since nulls are skipped during accumulation (as in standard SQL aggregation), an all-null input is an empty mean. Both paths now let the division produce NaN. Note this only matters for the count = 0 case: sum overflow still returns null. Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>

Route export_device_array_with_schema through the exporter so schema derivation sees the same rebuilt host layout that gets exported. This fixes host ListView fallback cases where the Arrow C schema could describe a different child layout than the emitted array. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

## Summary Closes: vortex-data#7830 (it was already closed but now it will be extra closed). See vortex-data#7830 (comment) for why we are removing this. To summarize: lossy compression / quantization doesn't really make sense in Vortex at all. Also removes the vector search benchmark that benchmarked this implementation of TurboQuant. ## API Changes Removes TurboQuant completely from the codebase. ## Testing N/A Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

## Summary This didn't need to take length because it already gets it from the array children. ## API Changes Removes `len` parameter. ## Testing N/A Signed-off-by: Connor Tsui <connor.tsui20@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

## Summary We currently have a build step before sanitizier runs, but we don't actually build the tests in that step, so this PR adds that. The changes these tests build times from 1:37 + 2:44 to 2:49, so about 90 seconds of savings for flows that seem to take around 7-8 minutes. Signed-off-by: Adam Gutglick <adam@spiraldb.com>

## Summary Install libc's debug symbols for codspeed runs, to get better symbolication for some of the code we end up calling. For example, for the noisy `chunked_bool_canonical_into` we currently get (note the unknown spans on the left): <img width="1275" height="231" alt="Screenshot 2026-06-12 at 09 03 39" src="https://github.com/user-attachments/assets/e97f57e0-467a-4d41-8438-63ebebe6bd27" /> With this change we get (note all the new blue spans, other colors are just noise): <img width="1282" height="228" alt="Screenshot 2026-06-12 at 09 04 22" src="https://github.com/user-attachments/assets/402c9be7-822c-442e-a825-327d0972d878" /> They include both memory allocation function but often more useful to us - SIMD instructions! On this PR's runs with a quick look I've found: - `_int_malloc` - `malloc_consolidate` - `__memset_avx2_unaligned_erms` - `__memcpy_avx_unaligned_erms` - `__memcmp_avx2_movbe` - `tcache_get_n` Signed-off-by: Adam Gutglick <adam@spiraldb.com>

## Summary Make sure to actually build rustdoc for all features and crates in the workspace, including private docs. This PR also includes fixes for every failure I've found. In some cases we had references to dead types/interfaces, I've tried to keep the intent as close as possible to original intent. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

## Summary Instead of running the whole workflow with a docker one-liner, run a containerized job. I think this is much easier to read and maintain. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

…rtex-data#8314) ## Summary aggregate functions to be able to do grouped aggregations before the fallback that slices group by group. Implements count for all arrays and sum for primitives. ## API Changes Grouped aggregate kernels now receive a single `GroupedArray` enum, covering `ListViewArray` and `FixedSizeListArray`, instead of exposing separate methods for each list representation --------- Signed-off-by: Onur Satici <onur@spiraldb.com>

## Summary It seems like the rate of tests hanging has gone up significantly recently which is very wasteful. This config marks tests as slow at 30s, and times them out after 3 periods (AKA 90 seconds). As far as I'm aware there's only 1 test that approaches 1 minute (also raised its priority), so this change should timeout any normal run. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com> Co-authored-by: Joe Isaacs <joe.isaacs@live.co.uk>

## Summary Adds a dedicated primitive zip kernel that selects values branchlessly per row. The generic zip path copies runs of `if_true`/`if_false` between mask boundaries — fast for clustered masks but degrading to per-element work on fragmented masks. This kernel walks the mask as 64-bit chunks and blends both sides per row with no data-dependent branch, so the inner loop stays branch-free and auto-vectorizable regardless of mask shape. Result validity reuses the shared `zip_validity` helper, which expresses validity selection as a (lazy) zip over the two validity bitmaps. > The branchless boolean zip kernel (vortex-data#8275) this builds on has now merged into `develop`; this branch has been rebased on top of it, so the diff here is primitive-only.

) ## Summary Remove the new and unused benchmark from the repo. Also took the opportunity to clean up some other dependencies. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

@myrrc

…ata#8383) ## Summary Inspired by @myrrc's #8634, this change makes a few subtle changes to our `dev` and `ci` build profiles. The main change here is removing debug symbols from `vortex-fastlanes`, which for a full build of the main package (`cargo build -p vortex --all-features --all-targets`) reduces build time by about 33% (from 90 seconds to under 60). The others are: 1. For the `ci` profile, enables `debug_assertions` and remove the explicit `strip` which seems redundant, and doesn't seem to make a measurable difference in build times. 2. Remove `vortex-bench` as a dev dependency for `vortex`, which is just weird and unused. Signed-off-by: Adam Gutglick <adam@spiraldb.com>

Add `vx_cuda_session_new` so C FFI callers can create a CUDA-enabled Vortex session once and reuse it across Arrow Device exports. Document the CUDA export `sync_event` lifetime and guard the Arrow C Device definitions in `vortex_cuda.h` with `ARROW_C_DEVICE_DATA_INTERFACE`, preserving `USE_OWN_ARROW_DEVICE` as an opt-out. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Cleanups from a review of the Arrow device export path, no behavior change: --------- Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

…function (vortex-data#8372) ## Summary This PR adds a native point type to `vortex-geo`. Points are by far the most common geometry in analytical datasets, and a columnar representation makes their coordinates directly accessible without parsing WKB. It also adds the scalar function: point-to-point distance with PostGIS `ST_Distance` semantics (planar/Euclidean, results in CRS units). ## API Changes Adds to `vortex-geo`, all registered through `vortex_geo::initialize`: - Extension type `Point` (`vortex.geo.point`): a location stored as `Struct<x, y, z?, m?>` of non-nullable `f64`, where `z?` is an optional elevation and `m?` an optional measure. - `Coordinate`: the internal value a point scalar unpacks to. - Scalar function `GeoDistance` (`vortex.geo.distance`): per-row distance between two equal-length point columns; either or both operands may be constant, in which case the query point is decoded once and broadcast. ## Testing Unit tests cover dtype validation for every GeoArrow dimension (and rejection of invalid storage), round-tripping a point column through scalar execution back to the original coordinates, WKT display for all four dimensions, and distance over all operand shapes: column-to-constant (either side), column-to-column, and constant-to-constant. --- Supersedes vortex-data#8342 (same change, moved from my fork to an in-repo branch). --------- Signed-off-by: Nemo Yu <zyu379@wisc.edu> Signed-off-by: Nemo Yu <zhenghong@spiraldb.com> Signed-off-by: Nemo Yu <83347615+HarukiMoriarty@users.noreply.github.com> Signed-off-by: "Nemo Yu" <zhenghong@spiraldb.com> Co-authored-by: Joe Isaacs <joe.isaacs@live.co.uk>

Upgrade the spiceai-53 fork to upstream Vortex 0.75.0 (DataFusion 53 -> 54). Conflicts resolved (vortex-datafusion): - convert/exprs.rs: adopt upstream 0.75.0 get_field pushdown semantics. can_scalar_fn_be_pushed_down now checks only that the scalar fn is a GetFieldFunc, matching upstream, and the pushdown predicate gains the new is_dynamic_physical_expr guard with the direct downcast_ref API. The fork's stricter "all args must be pushable" variant was dropped: under DF54's Utf8View migration it forced nested get_field over struct columns into the DataFusion-side leftover projection, where evaluating get_field over the Utf8View scan output mismatched the planned Utf8 type and failed upstream's nested-projection / schema-evolution pushdown tests. - persistent/sink.rs: keep the fork's direct multi-file VortexSink writer (tasks return Vec<(Path, WriteSummary)>, honoring target_file_size and the numbered-path extension), using the intact writer_dtype helper which already wraps the 0.75.0 from_arrow_schema API upstream inlined. Post-merge build fix: - vortex-array arrow ArrowMapArray conversion: propagate the now-fallible nulls() result with `?` (0.75.0 changed nulls() to return VortexResult<Validity>); aligns the fork's arrow-map support with the other call sites. Verified locally: - cargo build --workspace: clean (DataFusion 54). - cargo nextest run --workspace --no-fail-fast: 6135 passed, 0 failed, 522 skipped. Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>

Copilot

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

joseph-isaacs and others added 30 commits June 2, 2026 13:50

Add ArcSwapMap and use it throughout our session registries (vortex-d…

81184e7

…ata#8072) ArcSwap is faster than a lock for read. These session are mutable but mutations are rare and retrievals are common --------- Signed-off-by: Robert Kruszewski <github@robertk.io>

Implement ZipKernel for ListViewArray (vortex-data#8218)

81046d7

This can be very cheap with ListViews and ends up being very sad because we use a builder by default --------- Signed-off-by: Robert Kruszewski <github@robertk.io>

Add RoaringBitmap support to vortex-jni bindings (vortex-data#8220)

667e1d7

I think this is a bit messy but I don't know how to make it better in the C API This is being added to support positional deletes in iceberg Signed-off-by: Robert Kruszewski <github@robertk.io>

Don't recalculate chunk offsets for ChunkedReader (vortex-data#8231)

7a53ad5

Reuse them from parent layout. Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

Remove the full analysis from benchmark comments (vortex-data#8233)

66335d4

## Summary Turns out we ran into the maximum github comment size, so time to make it smaller. Signed-off-by: Adam Gutglick <adam@spiraldb.com>

ViewedLayoutChildren child layout cache (vortex-data#8234)

9daf90f

Avoid re-parsing flatbuffers when you re-access the same children Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

CachedIds for functions (vortex-data#8240)

1552135

Re-creating interner IDs is an issue for random access Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

feat[gpu]: arrow device array decimal export (vortex-data#8155)

a958108

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

clean: drop unused API lock-file (vortex-data#8241)

a26c0d6

Wipe the API lock file, as we're not using them anymore. Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Fix zoned min/max over fixed-size-list stats (vortex-data#8243)

e724fe7

Fixes vortex-data#8189 Signed-off-by: "Nicholas Gates" <nick@nickgates.com>

Support ParquetVariant through JNI (vortex-data#8129)

340d7be

## Summary This PR builds on top of vortex-data#8125 to add support for ParquetVariant arrays, as a step towards variant support for Iceberg. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

fix grouped sum index for listview array (vortex-data#8255)

583b003

## Summary grouped sum fallback should index the groups correctly when summing

fix: [profile.bench] codegen-units = 16 (vortex-data#8257)

6bd4a4c

AdamGS and others added 24 commits June 11, 2026 14:17

Add arrow import/export for vortex-json (vortex-data#8339)

729e17c

## Summary Adds import/export from `vortex-json` to Arrow's JSON canonical extension type. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

Use DUCKDB_SOURCE_DIR if supplied (vortex-data#8363)

82b24cd

If duckdb-vortex is supplied, don't attempt to download duckdb Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>

Fix generation of stat pruning expression for unsupported dtypes (vor…

0dd63f0

…tex-data#8326) Flagged by the fuzzer and likely doesn't happen in practice but better to be defensive here in case we add complex partials this is alternative version of vortex-data#8302

Containerize the musl tests (vortex-data#8387)

a3c5f8c

## Summary Instead of running the whole workflow with a docker one-liner, run a containerized job. I think this is much easier to read and maintain. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

Remove the unused website and clean other dependencies (vortex-data#8362

ab0e23e

) ## Summary Remove the new and unused benchmark from the repo. Also took the opportunity to clean up some other dependencies. --------- Signed-off-by: Adam Gutglick <adam@spiraldb.com>

chore: clean up Arrow device export (vortex-data#8359)

f67b594

Cleanups from a review of the Arrow device export path, no behavior change: --------- Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>

Copilot AI review requested due to automatic review settings June 12, 2026 22:51

lukekim added the changelog/chore label Jun 12, 2026

Copilot AI reviewed Jun 12, 2026

lukekim changed the base branch from spiceai-53 to spiceai-54 June 12, 2026 23:10

lukekim changed the title ~~Merge upstream Vortex 0.75.0 into spiceai-53 (DataFusion 53 → 54)~~ Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54) Jun 12, 2026

sgrebnov approved these changes Jun 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54)#65

Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54)#65
lukekim wants to merge 116 commits into
spiceai-54from
lukim/spiceai-53-vortex-0.75.0

lukekim commented Jun 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

lukekim commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

⚠️ Status — reconciliation required before merge

Conflicts resolved (the 0.75.0 merge itself)

Behavior note

Post-merge build fix

Testing

Reviewer note

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

lukekim commented Jun 12, 2026 •

edited

Loading