Skip to content

Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54)#65

Open
lukekim wants to merge 116 commits into
spiceai-54from
lukim/spiceai-53-vortex-0.75.0
Open

Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54)#65
lukekim wants to merge 116 commits into
spiceai-54from
lukim/spiceai-53-vortex-0.75.0

Conversation

@lukekim

@lukekim lukekim commented Jun 12, 2026

Copy link
Copy Markdown

What

Establishes spiceai-54 as the DataFusion 54 base by bringing the upstream Vortex 0.75.0 upgrade onto it. This is a 2-parent merge of the 0.75.0 tag (115 upstream commits, 658 files) onto the fork. The headline upstream change is DataFusion 53 → 54 (incl. the Arrow Utf8View/BinaryView migration).

spiceai-54 was forked off spiceai-53, which in the meantime gained the merged perf PRs #61 (per-ExecutionCtx ArrayKernels snapshot) and #62 (intra-file split sub-division) — see the reconciliation note below.

⚠️ Status — reconciliation required before merge

GitHub shows this PR as MERGEABLE (no textual conflicts), but the auto-merge will not compile as-is. spiceai-54 includes #61/#62 in their DF53 forms, which clash semantically with 0.75.0's refactor:

The fix is already prepared: the DF54-compatible forms of #61/#62 are the upstream develop ports (#8401 added load_full to ArcSwapMap + adapted snapshot(); #8400 is the 0.75.0-compatible subdivision). Reconciliation = merge spiceai-54 into this branch and apply those forms, then rebuild + retest.

Conflicts resolved (the 0.75.0 merge itself)

  • vortex-datafusion/convert/exprs.rs — adopted upstream 0.75.0's get_field pushdown semantics (can_scalar_fn_be_pushed_down checks only for GetFieldFunc) plus the new is_dynamic_physical_expr guard and direct downcast_ref API. The fork's stricter "all args must be pushable" variant was dropped — see Behavior note.
  • vortex-datafusion/persistent/sink.rs — kept the fork's direct multi-file VortexSink writer (tasks return Vec<(Path, WriteSummary)>, honoring target_file_size + numbered-path extension), via the intact writer_dtype helper.

Behavior note

Under DF54's Utf8View migration, the fork's stricter can_scalar_fn_be_pushed_down forced nested get_field over struct columns into the DataFusion-side leftover projection, where evaluating it over the Utf8View scan output mismatched the planned Utf8 type — failing 5 upstream nested-projection / schema-evolution pushdown tests. Realigning to upstream's design (which pushes get_field to Vortex and reconciles types via calculate_physical_schema) fixes all 5.

Post-merge build fix

  • vortex-array arrow ArrowMapArray — propagate the now-fallible nulls() result with ? (0.75.0 changed nulls() to return VortexResult<Validity>).

Testing

Validated the 0.75.0 merge itself (this branch's head, based on spiceai-53 before #61/#62):

  • cargo build --workspace — clean against DataFusion 54.
  • cargo nextest run --workspace --no-fail-fast6135 passed, 0 failed, 522 skipped.

⚠️ The spiceai-54 reconciliation (folding in #61/#62) is not yet built or tested — that's the outstanding work above.

Reviewer note

The dependency tree currently resolves both datafusion 53.1.0 and 54.0.0; it compiles and tests clean, but worth confirming it's intentional vs. a stray transitive pin.

joseph-isaacs and others added 30 commits June 2, 2026 13:50
…a#8208)

## Summary

This PR introduces `BitBufferView<'a>` and `BitBufferMutView<'a>` as
borrowing analogues to the owned `BitBuffer` and `BitBufferMut` types.
These new types enable zero-copy reading and in-place modification of
packed bitsets without cloning the underlying `ByteBuffer`.

This View will replace BitBuffer after the migration.

---------

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
## Summary

Introduce `LayoutReaderContext`, a typed-data registry threaded through
reader construction so an ancestor layout can publish values that
descendant layouts look up by type at construction time.

Breaking: Layout::new_reader and VTable::new_reader gain a
&LayoutReaderContext parameter. Out-of-tree impls will get a compile
error and must add it

---------

Signed-off-by: Onur Satici <onur@spiraldb.com>
…ata#8072)

ArcSwap is faster than a lock for read. These session are mutable but
mutations are rare and retrievals are common

---------

Signed-off-by: Robert Kruszewski <github@robertk.io>
## Summary

`MemorySession` is registered lazily. Because it implements default we
don't need to register it to the session and the first call to
`ctx.allocator()` would hold the session dashmap shard lock and insert
the allocator. The problem is when we are in a aggregate function we are
already potentially holding the same dashmap shard's lock, because the
aggregate functino registry is also read from the session.

because session vars are type id keyed and type id's change on every
build, if the aggregate function registry and memory session end up
being on the same dashmap shard, we deadlock running an aggregate
function.

This is fixing with a band aid, I am eagerly registering memory session
because that is the only one that we forgot to register, hence we won't
ever write lock the session after initialising now. I think the right
solution is to not hold the read lock of the dashmap shard while
executing kernels, which I will do in a follow up

Signed-off-by: Onur Satici <onur@spiraldb.com>
…ortex-data#8222)

## Summary

This PR has two changes:
1. Bumps the FSST dependency to include 0.5.11 which has faster
`Compressor::rebuild_from`.
2. When executing an FSST array through slice/filter/take, we keep the
compressor instead of creating new lazy instances, as part of a
`FSSTSymbolTable` type.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
This can be very cheap with ListViews and ends up being very sad because
we use
a builder by default

---------

Signed-off-by: Robert Kruszewski <github@robertk.io>
I think this is a bit messy but I don't know how to make it better in
the C API

This is being added to support positional deletes in iceberg

Signed-off-by: Robert Kruszewski <github@robertk.io>
## Summary

Seems to be the main difference between the impl we have and std, docs -
https://doc.rust-lang.org/std/hint/fn.select_unpredictable.html

Doesn't seem to make a difference in our benchmarks, probably worth
further investigation.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Reuse them from parent layout.

Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
## Summary

I/O tasks are likely to cause a lot of tracing to be produced. With an
explicit span on them, we can control it more precisely./

## API Changes

No API changes.

## Testing

No testing, this just allows consumers to configure tracing at a more
granular level.

@AdamGS

Signed-off-by: Frederic Branczyk <fbranczyk@gmail.com>
## Summary

Turns out we ran into the maximum github comment size, so time to make
it smaller.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Avoid re-parsing flatbuffers when you re-access the same children

Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
## Summary

Opptimizes the hot path for constructing binary views when the entire
decoded heap fits within a single buffer.

### Changes

**Optimized view construction**
(`vortex-array/src/arrays/varbinview/build_views.rs`):
- Fast path for the common case where the entire decoded heap fits in a
single buffer
- Eliminates per-element rollover checks and out-of-line
`BinaryView::make_view` calls in the hot loop
- Constructs reference views inline for long strings (>4 bytes) and
inlined views for short strings
- Reduces branch mispredictions and improves cache locality during view
construction

---------

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Signed-off-by: Claude <noreply@anthropic.com>
Signed-off-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Claude <noreply@anthropic.com>
## Summary

Adds CUDA Arrow Device export for Vortex `ListViewArray` by exporting it
as standard Arrow `List` for cuDF.

- GPU export path for contiguous list-views using device-built `i32`
Arrow offsets
- GPU rebuild path for supported non-contiguous primitive list-views
- Host fallback via CPU `ListViewArray` → `ListArray` rebuild when GPU
export cannot handle the shape
- CUB exclusive-scan wrapper used by the rebuild path
- New CUDA kernels, tests, e2e coverage, and benchmarks for list-view
export

---------

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Re-creating interner IDs is an issue for random access

Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Wipe the API lock file, as we're not using them anymore.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
## Summary

Removes the assertion in the `nulls` function so that `from_arrow`
errors instead of panicking if the `nullable` does not match the
actually nullability of the arrow array.

Also documents the `FromArrowArray` trait and method.

## Testing

2 basic unit tests.

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Fixes vortex-data#8189

Signed-off-by: "Nicholas Gates" <nick@nickgates.com>
…g masks pass them by reference (vortex-data#8221)

At the same time the set methods can initialize the validity buffer from
the
passed mask instead of allocating and copying

---------

Signed-off-by: Robert Kruszewski <github@robertk.io>
…ata#8213)

## Summary

split registration only cared about the immediate accessed fields of the
projection and filter expression. We do handle these nested expressions
correctly at layout readers, but we did still register splits as if we
are referencing all nested fields.

so if we had a `select a.x, a.y`, before, we were registering splits for
`Prefix(a)`, meaning all fields under a would have their splits
registered, ending up with many more tasks than needed. Now we do pass
`Prefix(a.y) | Prefix(a.x)` correctly to splits

---------

Signed-off-by: Onur Satici <onur@spiraldb.com>
…ortex-data#8238)

## Summary

Adds `bitpack_compare_sweep`, a benchmark exercising the **public**
`array.binary(rhs, op)` compare-against-constant path over **all eight
integer types** and **every valid bit width** (64Ki in-range elements
per case, no patches, no out-of-range fast path). It isolates the
`<BitPacked as CompareKernel>` unpack + per-element compare kernel.
## Summary

zip kernels are generally slower than fill_null kernels, and `CASE WHEN
is_null(x) THEN c ELSE x END` needs to resolve x twice, whereas
`fill_null(x, c)` does resolve x once

---------

Signed-off-by: Onur Satici <onur@spiraldb.com>
## Summary

This PR introduces a new JSON extension type, with more work building on
top of it planned in the future. This is a building block towards the
variant-based compressor and similar work.

## API Changes

Includes a minor change, unifiying the two `EmptyMetadata` types we
currently have into one.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
## Summary

This PR builds on top of vortex-data#8125 to add support for ParquetVariant arrays,
as a step towards variant support for Iceberg.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
This PR contains the following updates:

| Package | Change |
[Age](https://docs.renovatebot.com/merge-confidence/) |
[Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [starlette](https://redirect.github.com/Kludex/starlette)
([changelog](https://starlette.dev/release-notes/)) | `0.52.1` → `1.0.1`
|
![age](https://developer.mend.io/api/mc/badges/age/pypi/starlette/1.0.1?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/starlette/0.52.1/1.0.1?slim=true)
|

---

> [!WARNING]
> Some dependencies could not be looked up. Check the [Dependency
Dashboard](..vortex-data/issues/357) for more information.

---

### Starlette has missing Host header validation that poisons
request.url.path, bypassing path-based security checks
[CVE-2026-48710](https://nvd.nist.gov/vuln/detail/CVE-2026-48710) /
[GHSA-86qp-5c8j-p5mr](https://redirect.github.com/advisories/GHSA-86qp-5c8j-p5mr)

<details>
<summary>More information</summary>

#### Details
##### Summary
In affected versions, the HTTP `Host` request header was not validated
before being used to reconstruct `request.url`. Because the routing
algorithm relies on the raw HTTP path while `request.url` is rebuilt
from the `Host` header, a malformed header could make `request.url.path`
differ from the path that was actually requested. Middleware and
endpoints that apply security restrictions based on `request.url`
(rather than the raw `scope` path) could therefore be bypassed.

##### Details
When a client requests `http://example.com/foo`, it sends:

```http
GET /foo HTTP/1.1
Host: example.com
```

Affected versions reconstructed the URL by concatenating
`http://{host}{path}` and re-parsing the result. The `Host` value is
only valid as a `uri-host [ ":" port ]` per [RFC 9112
§3.2](https://www.rfc-editor.org/rfc/rfc9112.html#section-3.2-6), where
`uri-host` follows the restricted `host` grammar of [RFC 3986
§3.2.2](https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2). When
it contains characters outside that grammar - notably `/`, `?`, or `#` -
those characters move the path/query/fragment boundaries during
re-parsing, so the parsed `request.url.path` no longer matches the path
the server actually received. For example:

```http
GET /foo HTTP/1.1
Host: example.com/abc?bar=
```

reconstructs to `http://example.com/abc?bar=/foo`, whose parsed `path`
is `/abc` - even though routing used the real path `/foo`. The router
still dispatches to `/foo` and the endpoint executes, but any middleware
or code that reads `request.url.path` sees `/abc`, so path-based
authorization checks can be bypassed.

##### Impact
Any application running an affected version that relies on `request.url`
(or `request.url.path`) for security-sensitive decisions is affected.
The most common case is middleware that gates access to certain path
prefixes based on `request.url.path`. Deployments fronted by a proxy or
load balancer are mitigated only if that proxy rejects or normalizes the
malformed `Host` header before forwarding and the application does not
trust attacker-controlled host headers (e.g. `X-Forwarded-Host`)
elsewhere.

##### Mitigation
Upgrade to a patched version, which validates the `Host` header against
the grammar of [RFC 9112
§3.2](https://www.rfc-editor.org/rfc/rfc9112.html#section-3.2-6) / [RFC
3986 §3.2.2](https://www.rfc-editor.org/rfc/rfc3986.html#section-3.2.2)
when constructing `request.url` and falls back to `scope["server"]` for
malformed values.

#### Severity
- CVSS Score: 6.5 / 10 (Medium)
- Vector String: `CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:L/I:L/A:N`

#### References
-
[https://github.com/Kludex/starlette/security/advisories/GHSA-86qp-5c8j-p5mr](https://redirect.github.com/Kludex/starlette/security/advisories/GHSA-86qp-5c8j-p5mr)
-
[https://nvd.nist.gov/vuln/detail/CVE-2026-48710](https://nvd.nist.gov/vuln/detail/CVE-2026-48710)
-
[https://github.com/Kludex/starlette/commit/764dab0dcfb9033d75442d7a359645c9f94648c6](https://redirect.github.com/Kludex/starlette/commit/764dab0dcfb9033d75442d7a359645c9f94648c6)
- [https://badhost.org](https://badhost.org)
-
[https://github.com/pypa/advisory-database/tree/main/vulns/starlette/PYSEC-2026-161.yaml](https://redirect.github.com/pypa/advisory-database/tree/main/vulns/starlette/PYSEC-2026-161.yaml)
-
[https://ostif.org/disclosing-the-badhost-vulnerability-in-starlette](https://ostif.org/disclosing-the-badhost-vulnerability-in-starlette)
- [https://www.secwest.net/starlette](https://www.secwest.net/starlette)
-
[https://www.x41-dsec.de/lab/advisories/x41-2026-002-starlette](https://www.x41-dsec.de/lab/advisories/x41-2026-002-starlette)
-
[https://github.com/advisories/GHSA-86qp-5c8j-p5mr](https://redirect.github.com/advisories/GHSA-86qp-5c8j-p5mr)

This data is provided by the [GitHub Advisory
Database](https://redirect.github.com/advisories/GHSA-86qp-5c8j-p5mr)
([CC-BY
4.0](https://redirect.github.com/github/advisory-database/blob/main/LICENSE.md)).
</details>

---

### Release Notes

<details>
<summary>Kludex/starlette (starlette)</summary>

###
[`v1.0.1`](https://redirect.github.com/Kludex/starlette/releases/tag/1.0.1):
Version 1.0.1

[Compare
Source](https://redirect.github.com/Kludex/starlette/compare/1.0.0...1.0.1)

#### What's Changed

- Ignore malformed `Host` header when constructing `request.url` by
[@&vortex-data#8203;Kludex](https://redirect.github.com/Kludex) in
[#&vortex-data#8203;3279](https://redirect.github.com/Kludex/starlette/pull/3279)

**Full Changelog**:
<Kludex/starlette@1.0.0...1.0.1>

###
[`v1.0.0`](https://redirect.github.com/Kludex/starlette/releases/tag/1.0.0):
Version 1.0.0

[Compare
Source](https://redirect.github.com/Kludex/starlette/compare/0.52.1...1.0.0)

Starlette 1.0 is here! 🎉

After nearly eight years since its creation, Starlette has reached its
first stable release.

A special thank you to
[@&vortex-data#8203;lovelydinosaur](https://redirect.github.com/lovelydinosaur),
the creator of Starlette, Uvicorn, HTTPX and MkDocs, whose work helped
to lay the foundation for the modern async Python ecosystem. 🙏

Thank you to [@&vortex-data#8203;adriangb](https://redirect.github.com/adriangb),
[@&vortex-data#8203;graingert](https://redirect.github.com/graingert),
[@&vortex-data#8203;agronholm](https://redirect.github.com/agronholm),
[@&vortex-data#8203;florimondmanca](https://redirect.github.com/florimondmanca),
[@&vortex-data#8203;aminalaee](https://redirect.github.com/aminalaee),
[@&vortex-data#8203;tiangolo](https://redirect.github.com/tiangolo),
[@&vortex-data#8203;alex-oleshkevich](https://redirect.github.com/alex-oleshkevich),
[@&vortex-data#8203;abersheeran](https://redirect.github.com/abersheeran), and
[@&vortex-data#8203;uSpike](https://redirect.github.com/uSpike) for helping make
Starlette what it is today. And to all my sponsors - especially
[@&vortex-data#8203;tiangolo](https://redirect.github.com/tiangolo),
[@&vortex-data#8203;huggingface](https://redirect.github.com/huggingface), and
[@&vortex-data#8203;elevenlabs](https://redirect.github.com/elevenlabs) - thank you
for your support!

Thank you to all [290+
contributors](https://redirect.github.com/encode/starlette/graphs/contributors)
who have shaped Starlette over the years! ❤️

Read more on the [blog
post](https://marcelotryle.com/blog/2026/03/22/starlette-10-is-here/).

Check out the full release notes at
<https://www.starlette.io/release-notes/#&#8203;100-march-22-2026>

***

**Full Changelog**:
<Kludex/starlette@1.0.0rc1...1.0.0>

</details>

---

### Configuration

📅 **Schedule**: (UTC)

- Branch creation
  - At any time (no schedule defined)
- Automerge
  - At any time (no schedule defined)

🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR was generated by [Mend Renovate](https://mend.io/renovate/).
View the [repository job
log](https://developer.mend.io/github/vortex-data/vortex).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4yMDkuNCIsInVwZGF0ZWRJblZlciI6IjQzLjIwOS40IiwidGFyZ2V0QnJhbmNoIjoiZGV2ZWxvcCIsImxhYmVscyI6WyJjaGFuZ2Vsb2cvY2hvcmUiXX0=-->

Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
## Summary

grouped sum fallback should index the groups correctly when summing
…ranspose (vortex-data#8239)

## Summary


Replaces the unpack-then-compare streaming kernel for
compare-against-constant with the FastLanes fused `unpack_cmp`:

- compare each value **as it is unpacked**, accumulating results
straight into a transposed 1024-bit mask (`[u64; 16]`, one
register-resident word per lane — no `[bool; 1024]`/`[T; 1024]`
scratch),
- a single SIMD `untranspose_bits` per block rotates the mask into
logical row order, copied directly into the output bit buffer,
- inline patches are spliced in afterwards; sliced (`offset != 0`)
arrays fall back to the scalar streaming predicate.

---------

Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Co-authored-by: Claude <noreply@anthropic.com>
## Summary

This PR adds `#[inline]` hints to a collection of small,
frequently-called functions across the codebase to improve performance.
These are primarily simple wrapper methods, trait implementations, and
utility functions that benefit from inlining to reduce function call
overhead.

These are all candidates for inlining because they are:
1. Small wrapper functions with minimal logic
2. Called frequently in hot paths (e.g., binary search, array access)
3. Generic or trait methods where inlining enables better
monomorphization
4. Simple accessors and type checks
AdamGS and others added 24 commits June 11, 2026 14:17
## Summary

Adds import/export from `vortex-json` to Arrow's JSON canonical
extension type.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Convert the streaming list_view kernels (offsets check, rebuild init
scan, offsets validation) and decimal_cast from per-thread-contiguous
element ranges to block-stride loops so warp accesses stay coalesced. On
GH200 the contiguous-offsets check on 10M lists drops from 718us to 80us
(~1.4 TB/s, 9x) and the take-based rebuild path improves by 35%. The
rebuild gather kernel keeps its per-list layout since its access pattern
is data-dependent.

Also enqueue the status and total-bytes device-to-host copies in the
Arrow Binary export before awaiting either, so both readbacks complete
in one stream round-trip instead of two.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
## Summary

Some minor touchups to the `FileSource` based DataFusion integration.

The two substantial changes here are:
1. When reading files, if we didn't have the footer in the cache, make
sure to insert it. That can happen when using `ListingTable` without
stats inference, or when using `FileScanConfig` directly in another
table provider.
2. On write - move the schema-to-dtype logic outside of the loop. It
only needs to happen once and the `dtype` is cloned per write task.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
If duckdb-vortex is supplied, don't attempt to download duckdb

Signed-off-by: Mikhail Kot <mikhail@spiraldb.com>
…tex-data#8326)

Flagged by the fuzzer and likely doesn't happen in practice but better
to be defensive here in case we add complex partials

this is alternative version of vortex-data#8302
vortex-data#8369)

## Summary

Closes: vortex-data#8366

`VortexFile::can_prune` stopped pruning for every `and`/`or` and `eq`
predicate after vortex-data#7575 removed bottom-up constant-folding. 

This change makes `can_prune` mirror that
read-out (execute to `Canonical`, take the row-0 scalar) and adds an
end-to-end regression test covering bare, `and`/`or`, and `eq`
predicates with non-falsifiable controls.


---------

Signed-off-by: Thomas Santerre <thomas@santerre.xyz>
Adds `f64` cases to the `aggregate_sum` and `aggregate_grouped`
benchmarks in `vortex-array`, to establish a baseline before changing
float summation to Kahan (Neumaier) compensated summation.

Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
…ortex-data#8365)

`Mean::finalize_scalar` returned null when the count was zero, while the
array `finalize` path computes `sum/count = 0/0 = NaN` for the same
input. A mean over an all-null group therefore gave different results
depending on which accumulator we're using.


https://github.com/vortex-data/vortex/blob/90d743356722a8d5ca7e39053229654778bacf46/vortex-array/src/aggregate_fn/fns/mean/mod.rs#L85-L93

Since nulls are skipped during accumulation (as in standard SQL
aggregation), an all-null input is an empty mean. Both paths now let the
division produce NaN.

Note this only matters for the count = 0 case: sum overflow still
returns null.

Signed-off-by: Dimitar Dimitrov <dimitar@spiraldb.com>
Route export_device_array_with_schema through the exporter so schema
derivation sees the same rebuilt host layout that gets exported. This
fixes host ListView fallback cases where the Arrow C schema could
describe a different child layout than the emitted array.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
## Summary

Closes: vortex-data#7830 (it was
already closed but now it will be extra closed).

See
vortex-data#7830 (comment)
for why we are removing this. To summarize: lossy compression /
quantization doesn't really make sense in Vortex at all.

Also removes the vector search benchmark that benchmarked this
implementation of TurboQuant.

## API Changes

Removes TurboQuant completely from the codebase.

## Testing

N/A

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
## Summary

This didn't need to take length because it already gets it from the
array children.

## API Changes

Removes `len` parameter.

## Testing

N/A

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
## Summary

We currently have a build step before sanitizier runs, but we don't
actually build the tests in that step, so this PR adds that. The changes
these tests build times from 1:37 + 2:44 to 2:49, so about 90 seconds of
savings for flows that seem to take around 7-8 minutes.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
## Summary

Install libc's debug symbols for codspeed runs, to get better
symbolication for some of the code we end up calling.

For example, for the noisy `chunked_bool_canonical_into` we currently
get (note the unknown spans on the left):
<img width="1275" height="231" alt="Screenshot 2026-06-12 at 09 03 39"
src="https://github.com/user-attachments/assets/e97f57e0-467a-4d41-8438-63ebebe6bd27"
/>

With this change we get (note all the new blue spans, other colors are
just noise):
<img width="1282" height="228" alt="Screenshot 2026-06-12 at 09 04 22"
src="https://github.com/user-attachments/assets/402c9be7-822c-442e-a825-327d0972d878"
/>

They include both memory allocation function but often more useful to us
- SIMD instructions! On this PR's runs with a quick look I've found:
- `_int_malloc`
- `malloc_consolidate`
- `__memset_avx2_unaligned_erms`
- `__memcpy_avx_unaligned_erms`
- `__memcmp_avx2_movbe`
- `tcache_get_n`

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
## Summary

Make sure to actually build rustdoc for all features and crates in the
workspace, including private docs.

This PR also includes fixes for every failure I've found. In some cases
we had references to dead types/interfaces, I've tried to keep the
intent as close as possible to original intent.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
## Summary

Instead of running the whole workflow with a docker one-liner, run a
containerized job. I think this is much easier to read and maintain.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
…rtex-data#8314)

## Summary

aggregate functions to be able to do grouped aggregations before the
fallback that slices group by group.

Implements count for all arrays and sum for primitives.



## API Changes

Grouped aggregate kernels now receive a single `GroupedArray` enum,
covering `ListViewArray` and `FixedSizeListArray`, instead of exposing
separate methods for each list representation

---------

Signed-off-by: Onur Satici <onur@spiraldb.com>
## Summary

It seems like the rate of tests hanging has gone up significantly
recently which is very wasteful. This config marks tests as slow at 30s,
and times them out after 3 periods (AKA 90 seconds).

As far as I'm aware there's only 1 test that approaches 1 minute (also
raised its priority), so this change should timeout any normal run.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Co-authored-by: Joe Isaacs <joe.isaacs@live.co.uk>
## Summary

Adds a dedicated primitive zip kernel that selects values branchlessly
per row.

The generic zip path copies runs of `if_true`/`if_false` between mask
boundaries — fast for clustered masks but degrading to per-element work
on fragmented masks. This kernel walks the mask as 64-bit chunks and
blends both sides per row with no data-dependent branch, so the inner
loop stays branch-free and auto-vectorizable regardless of mask shape.
Result validity reuses the shared `zip_validity` helper, which expresses
validity selection as a (lazy) zip over the two validity bitmaps.

> The branchless boolean zip kernel (vortex-data#8275) this builds on has now
merged into `develop`; this branch has been rebased on top of it, so the
diff here is primitive-only.
)

## Summary

Remove the new and unused benchmark from the repo. Also took the
opportunity to clean up some other dependencies.

---------

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
…ata#8383)

## Summary

Inspired by @myrrc's #8634, this change makes a few subtle changes to
our `dev` and `ci` build profiles. The main change here is removing
debug symbols from `vortex-fastlanes`, which for a full build of the
main package (`cargo build -p vortex --all-features --all-targets`)
reduces build time by about 33% (from 90 seconds to under 60).

The others are:
1. For the `ci` profile, enables `debug_assertions` and remove the
explicit `strip` which seems redundant, and doesn't seem to make a
measurable difference in build times.
2. Remove `vortex-bench` as a dev dependency for `vortex`, which is just
weird and unused.

Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Add `vx_cuda_session_new` so C FFI callers can create a CUDA-enabled
Vortex session once and reuse it across Arrow Device exports.

Document the CUDA export `sync_event` lifetime and guard the Arrow C
Device definitions in `vortex_cuda.h` with
`ARROW_C_DEVICE_DATA_INTERFACE`, preserving `USE_OWN_ARROW_DEVICE` as an
opt-out.

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Cleanups from a review of the Arrow device export path, no behavior
change:

---------

Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
…function (vortex-data#8372)

## Summary

This PR adds a native point type to `vortex-geo`. Points are by far the
most common geometry in analytical datasets, and a columnar
representation makes their coordinates directly accessible without
parsing WKB.

It also adds the scalar function: point-to-point distance with PostGIS
`ST_Distance` semantics (planar/Euclidean, results in CRS units).

## API Changes

Adds to `vortex-geo`, all registered through `vortex_geo::initialize`:

- Extension type `Point` (`vortex.geo.point`): a location stored as
`Struct<x, y, z?, m?>` of non-nullable `f64`, where `z?` is an optional
elevation and `m?` an optional measure.
- `Coordinate`: the internal value a point scalar unpacks to.
- Scalar function `GeoDistance` (`vortex.geo.distance`): per-row
distance between two equal-length point columns; either or both operands
may be constant, in which case the query point is decoded once and
broadcast.

## Testing

Unit tests cover dtype validation for every GeoArrow dimension (and
rejection of invalid storage), round-tripping a point column through
scalar execution back to the original coordinates, WKT display for all
four dimensions, and distance over all operand shapes:
column-to-constant (either side), column-to-column, and
constant-to-constant.

---
Supersedes vortex-data#8342 (same change, moved from my fork to an in-repo branch).

---------

Signed-off-by: Nemo Yu <zyu379@wisc.edu>
Signed-off-by: Nemo Yu <zhenghong@spiraldb.com>
Signed-off-by: Nemo Yu <83347615+HarukiMoriarty@users.noreply.github.com>
Signed-off-by: "Nemo Yu" <zhenghong@spiraldb.com>
Co-authored-by: Joe Isaacs <joe.isaacs@live.co.uk>
Upgrade the spiceai-53 fork to upstream Vortex 0.75.0 (DataFusion 53 -> 54).

Conflicts resolved (vortex-datafusion):
- convert/exprs.rs: adopt upstream 0.75.0 get_field pushdown semantics.
  can_scalar_fn_be_pushed_down now checks only that the scalar fn is a
  GetFieldFunc, matching upstream, and the pushdown predicate gains the new
  is_dynamic_physical_expr guard with the direct downcast_ref API. The fork's
  stricter "all args must be pushable" variant was dropped: under DF54's
  Utf8View migration it forced nested get_field over struct columns into the
  DataFusion-side leftover projection, where evaluating get_field over the
  Utf8View scan output mismatched the planned Utf8 type and failed upstream's
  nested-projection / schema-evolution pushdown tests.
- persistent/sink.rs: keep the fork's direct multi-file VortexSink writer
  (tasks return Vec<(Path, WriteSummary)>, honoring target_file_size and the
  numbered-path extension), using the intact writer_dtype helper which already
  wraps the 0.75.0 from_arrow_schema API upstream inlined.

Post-merge build fix:
- vortex-array arrow ArrowMapArray conversion: propagate the now-fallible
  nulls() result with `?` (0.75.0 changed nulls() to return
  VortexResult<Validity>); aligns the fork's arrow-map support with the other
  call sites.

Verified locally:
- cargo build --workspace: clean (DataFusion 54).
- cargo nextest run --workspace --no-fail-fast: 6135 passed, 0 failed,
  522 skipped.

Signed-off-by: Luke Kim <80174+lukekim@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 12, 2026 22:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of files (300). Try reducing the number of changed files and requesting a review from Copilot again.

@lukekim lukekim changed the base branch from spiceai-53 to spiceai-54 June 12, 2026 23:10
@lukekim lukekim changed the title Merge upstream Vortex 0.75.0 into spiceai-53 (DataFusion 53 → 54) Merge upstream Vortex 0.75.0 into spiceai-54 (DataFusion 53 → 54) Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.