Support reporting statistics in spark datasource by robert3005 · Pull Request #8057 · vortex-data/vortex

robert3005 · 2026-05-22T09:55:26Z

Spark mostly focuses on sizeInBytes which we populate from file sizes with
scaling. We also report numRows since that exists in our datasource.

codspeed-hq · 2026-05-22T10:05:34Z

Merging this PR will not alter performance

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
❌ 1 regressed benchmark
✅ 1528 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	Simulation	`bitwise_not_vortex_buffer_mut[128]`	215.3 ns	244.4 ns	-11.93%
⚡	WallTime	`cuda/bitpacked_u8/unpack/3bw[100M]`	352.4 µs	301.6 µs	+16.84%
⚡	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	308.5 µs	273.5 µs	+12.78%
⚡	Simulation	`chunked_varbinview_into_canonical[(100, 100)]`	361.4 µs	326.2 µs	+10.81%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing rk/sparkstats (b7c05bd) with develop (0dd6db7)}

Signed-off-by: Robert Kruszewski <github@robertk.io>

a10y · 2026-06-12T15:53:03Z

+        double scaled = SQLConf.get().fileCompressionFactor()
+                * fileBytes.getAsLong()
+                / tableDefaultSize
+                * readSchema.defaultSize();
+        return OptionalLong.of((long) scaled);


what is this doing?

This is taking the logic that spark has for the size of the relation - it's trying to scale the file size by the ratio of read schema and file schema. Basically try to take column pruning into consideration

a10y · 2026-06-12T15:53:25Z

+        public Map<NamedReference, ColumnStatistics> columnStats() {
+            return Map.of();
+        }


in the future this is where we'd report the file stats right?

Yes but this doesn't happen right now anywhere in spark and you'd have to read all the footers here to produce this value which might be bad

robert3005 force-pushed the rk/sparkstats branch 3 times, most recently from e97797d to 4d9b080 Compare May 28, 2026 01:02

robert3005 added the changelog/feature A new feature label May 28, 2026

robert3005 force-pushed the rk/sparkstats branch from 4d9b080 to d29ccf5 Compare June 2, 2026 10:11

robert3005 force-pushed the rk/sparkstats branch from d29ccf5 to deb1dc0 Compare June 9, 2026 21:56

robert3005 requested a review from a team June 9, 2026 21:56

Let spark report statistics

b7c05bd

Signed-off-by: Robert Kruszewski <github@robertk.io>

robert3005 force-pushed the rk/sparkstats branch from deb1dc0 to b7c05bd Compare June 11, 2026 14:15

robert3005 requested review from AdamGS and a10y June 12, 2026 13:39

a10y reviewed Jun 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support reporting statistics in spark datasource#8057

Support reporting statistics in spark datasource#8057
robert3005 wants to merge 1 commit into
developfrom
rk/sparkstats

robert3005 commented May 22, 2026

Uh oh!

codspeed-hq Bot commented May 22, 2026 •

edited

Loading

Uh oh!

a10y Jun 12, 2026

Uh oh!

robert3005 Jun 12, 2026

Uh oh!

a10y Jun 12, 2026

Uh oh!

robert3005 Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robert3005 commented May 22, 2026

Uh oh!

codspeed-hq Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Performance Changes

Uh oh!

a10y Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

robert3005 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

a10y Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

robert3005 Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codspeed-hq Bot commented May 22, 2026 •

edited

Loading