Support reporting statistics in spark datasource#8057
Conversation
Merging this PR will not alter performance
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | bitwise_not_vortex_buffer_mut[128] |
215.3 ns | 244.4 ns | -11.93% |
| ⚡ | WallTime | cuda/bitpacked_u8/unpack/3bw[100M] |
352.4 µs | 301.6 µs | +16.84% |
| ⚡ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
308.5 µs | 273.5 µs | +12.78% |
| ⚡ | Simulation | chunked_varbinview_into_canonical[(100, 100)] |
361.4 µs | 326.2 µs | +10.81% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing rk/sparkstats (b7c05bd) with develop (0dd6db7)
e97797d to
4d9b080
Compare
Signed-off-by: Robert Kruszewski <github@robertk.io>
deb1dc0 to
b7c05bd
Compare
| double scaled = SQLConf.get().fileCompressionFactor() | ||
| * fileBytes.getAsLong() | ||
| / tableDefaultSize | ||
| * readSchema.defaultSize(); | ||
| return OptionalLong.of((long) scaled); |
There was a problem hiding this comment.
This is taking the logic that spark has for the size of the relation - it's trying to scale the file size by the ratio of read schema and file schema. Basically try to take column pruning into consideration
| public Map<NamedReference, ColumnStatistics> columnStats() { | ||
| return Map.of(); | ||
| } |
There was a problem hiding this comment.
in the future this is where we'd report the file stats right?
There was a problem hiding this comment.
Yes but this doesn't happen right now anywhere in spark and you'd have to read all the footers here to produce this value which might be bad
Spark mostly focuses on sizeInBytes which we populate from file sizes with
scaling. We also report numRows since that exists in our datasource.