Skip to content

Fix size in bytes calculation for ORC DictionaryBuilder#25615

Closed
mcarmonagonzalez wants to merge 6 commits into
trinodb:masterfrom
mcarmonagonzalez:orc-dictionary-size-fix
Closed

Fix size in bytes calculation for ORC DictionaryBuilder#25615
mcarmonagonzalez wants to merge 6 commits into
trinodb:masterfrom
mcarmonagonzalez:orc-dictionary-size-fix

Conversation

@mcarmonagonzalez
Copy link
Copy Markdown
Contributor

Description

a1fcb0d introduced a change to refactor VariableWidthBlockBuilder out of the ORC DictionaryBuilder class. Previously the DictionaryBuilder class would report the size in bytes by calling on the underlying elementBlock.getSizeInBytes() whose size was calculated based off the output slice size and the bytes stored per entry/position.

After this refactor, DictionaryBuilder size is now being reported as the sum of the slice output size and the size of the offsets array, which is causing the size to be over-reported as not all positions of the offset array may be used. Over reporting the size in bytes of the DictionaryBuilder causes issues downstream when doing file size optimizations.

This PR corrects the size in bytes calculation to be similar to what it was before the refactor to accurately report how many bytes are actually in use.

Additional context and related issues

Release notes

( x) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot Bot added the cla-signed label Apr 17, 2025
@mcarmonagonzalez mcarmonagonzalez force-pushed the orc-dictionary-size-fix branch from 8791106 to 61ac830 Compare April 17, 2025 15:22
@github-actions github-actions Bot added docs iceberg Iceberg connector labels Apr 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

3 participants