Skip to content

Support reporting statistics in spark datasource#8057

Open
robert3005 wants to merge 1 commit into
developfrom
rk/sparkstats
Open

Support reporting statistics in spark datasource#8057
robert3005 wants to merge 1 commit into
developfrom
rk/sparkstats

Conversation

@robert3005

Copy link
Copy Markdown
Contributor

Spark mostly focuses on sizeInBytes which we populate from file sizes with
scaling. We also report numRows since that exists in our datasource.

@codspeed-hq

codspeed-hq Bot commented May 22, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
❌ 1 regressed benchmark
✅ 1528 untouched benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bitwise_not_vortex_buffer_mut[128] 215.3 ns 244.4 ns -11.93%
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 352.4 µs 301.6 µs +16.84%
Simulation chunked_varbinview_canonical_into[(100, 100)] 308.5 µs 273.5 µs +12.78%
Simulation chunked_varbinview_into_canonical[(100, 100)] 361.4 µs 326.2 µs +10.81%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing rk/sparkstats (b7c05bd) with develop (0dd6db7)

Open in CodSpeed

@robert3005 robert3005 force-pushed the rk/sparkstats branch 3 times, most recently from e97797d to 4d9b080 Compare May 28, 2026 01:02
@robert3005 robert3005 added the changelog/feature A new feature label May 28, 2026
@robert3005 robert3005 requested a review from a team June 9, 2026 21:56
Signed-off-by: Robert Kruszewski <github@robertk.io>
Comment on lines +162 to +166
double scaled = SQLConf.get().fileCompressionFactor()
* fileBytes.getAsLong()
/ tableDefaultSize
* readSchema.defaultSize();
return OptionalLong.of((long) scaled);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this doing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taking the logic that spark has for the size of the relation - it's trying to scale the file size by the ratio of read schema and file schema. Basically try to take column pruning into consideration

Comment on lines +172 to +174
public Map<NamedReference, ColumnStatistics> columnStats() {
return Map.of();
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the future this is where we'd report the file stats right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but this doesn't happen right now anywhere in spark and you'd have to read all the footers here to produce this value which might be bad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants