Skip to content

Frequency partitioning and FOR encoding rebased & synced with nimble/main#636

Open
David-C-L wants to merge 30 commits intofacebookincubator:mainfrom
David-C-L:freq-part-and-for-synced
Open

Frequency partitioning and FOR encoding rebased & synced with nimble/main#636
David-C-L wants to merge 30 commits intofacebookincubator:mainfrom
David-C-L:freq-part-and-for-synced

Conversation

@David-C-L
Copy link
Copy Markdown

@David-C-L David-C-L commented Apr 3, 2026

TPCH SF10 Compression Rates per Column

freq-part-and-FOR-compression-rates-TPCH-SF10 - Bold bars indicate encodings that allow value-granularity random access - Frequency Partitioning efficacy depends on the value-type-size in bytes and the number of unique values (along with the distribution of values, which is roughly uniform for TPC-H). These plots show frequency partition encoding is effective on the same columns as dictionary encoding (for the reason that the num_unique_values < 2^(num_bits_for_type), allowing keys to be small). However, the frequency partitioning improves on dictionary encoding when many frequent values can be given a smaller key size than dictionary encoding would apply, as can be seen most prominently with **l_suppkey** and **l_linenumber**. - FOR encodings (PFORDelta and TurboPFOR here) are particularly effective when the domain of values covers a small range within frames (e.g. monotonically increasing values, or clustered values). Their efficacy is most prominent with **l_orderkey** where values are large but exist within ranges for different order batches (clustering).

Moderate number of unique int32 values following Zipfian distribution with varying alpha

freq-part-order-preserving-index-overhead-synthetic-zipfian-int32 - Table at the bottom indicates the properties of each encoding: the top row shows the encoding supports value-granularity random-access, and the bottom row shows if the encoding preserves the initial order of the value stream - Frequency Partitioning is as effective as dictionary encoding for datasets with little bias in their distribution (i.e. Zipfian with alpha = 0 or 0.5), since there is little difference in the frequency of values to exploit assigning smaller keys. For more biased datasets (alpha >= 1.0) is highly effective, maintaining competition with Zstd and OpenZL (encoding level 2). This is due to being able to assign small keys to very frequent values, which cover large swathes of the dataset. As the initial value-stream order may be important for some rec-sys training, each frequency partitioning bar contains the overhead for the most compressed order-preserving index. - FOR encodings (PFORDelta and TurboPFOR here), perform poorly on this dataset when little bias is present due to the order of values being random, meaning there is no common reference point to exploit for a given frame. As more bias is introduced, however, fewer values comprise larger proportions of the dataset and allow frames to contain many of the same values, increasing the probability of computing small residuals (especially a 0), and so the encoding offers better compression -- albeit not as effective as the other schemes since little ordering is present in the dataset.

Moderate number of unique int32 values -- Order-preserving index overhead

freq-part-and-for-compression-rates-synthetic-zipfian-int32 - The plot shows 4 different strategies for preserving the order of tuples (5 columns, so 5 values per tuple) that have been shuffled across 5 different partitions. Given an original index **x** for the tuple **t**, each order-preserving index stores or allows for the computation of the partition in which **t** is stored (**p**), and the index in **p** where **t** can be found. - The full global index simply stores an int32_t value for the partition index and the index in the partition for each tuple, along with a roaring bitmap for fast lookups - The RLE full global index performs RLE compression on each component of the full global index - The optimised index drops the int32_t value for the index in the partition, and instead calculates this using the popcount of the roaring bitmap - The RLE optimised index performs RLE compression on each component of the optimised index - The plot shows each index's total size, the size relative to the full table, and the lookup time complexity.

Comprehensive Encode, Decode, Compression, Memory use for Frequency Partition and FOR on simple Zipfian data

compression_rate encode_time decode_bulk_time decode_random_access_time encode_peak_memory decode_bulk_peak_memory

This PR also contains some changes for compatibility with Clang++-16.

@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 3, 2026

Hi @David-C-L!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@David-C-L David-C-L marked this pull request as ready for review April 3, 2026 01:40
@meta-cla
Copy link
Copy Markdown

meta-cla Bot commented Apr 3, 2026

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 3, 2026
@srsuryadev
Copy link
Copy Markdown
Contributor

srsuryadev commented Apr 3, 2026

Thank you, @David-C-L! Will review and update cc: @zzhao0

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 3, 2026

@srsuryadev has imported this pull request. If you are a Meta employee, you can view this in D99487883.

Copy link
Copy Markdown
Contributor

@srsuryadev srsuryadev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the initial version @David-C-L, added some initial comments

Comment thread dwio/nimble/encodings/FrequencyPartitionEncoding.h Outdated
Comment thread dwio/nimble/encodings/FrequencyPartitionEncoding.h Outdated
Comment thread dwio/nimble/encodings/FrequencyPartitionEncoding.h Outdated
Comment thread dwio/nimble/encodings/FrequencyPartitionEncoding.h Outdated
Comment thread dwio/nimble/encodings/FrequencyPartitionEncoding.h Outdated
Comment thread dwio/nimble/encodings/ForEncoding.h Outdated
Comment thread dwio/nimble/encodings/ForEncoding.h Outdated
Comment on lines +276 to +279
for (uint32_t i = 0; i < rowCount; ++i) {
output[i] = decodeValue(currentRow_ + i);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can see if we can explore bulk processing here, but need not to be in this PR though! For now, it is okay

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bulk processing shouldn't be too hard to implement. I'll give it a quick go and if it's not too much I'll add it to the PR

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've created a decodeRange function to implement bulk processing in 825c731

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Can we update the description with the decode time comparison as well

@srsuryadev srsuryadev requested a review from zzhao0 April 8, 2026 07:54
// operations for efficient random access.
Prefix = 11,
// Partitions data by value frequency. Frequent values get shorter bit-width
// codes. Rows are reordered to group values with same code length.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@David-C-L Can we see if we can achieve better performance without re-ordering the rows?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row reordering is to enable efficient value-granularity random access. Without reordering, the encoding would be limited to O(n) bulk decoding (similar to huffman encoding) due to the variable-sized keys. We did explore some indexes (they should be explained in the initial PR description) that could be used as a view for interfacing with the original order, so you get the benefit of reordering for random access while allowing access through the original ordering.

Do you think it's worth implementing these indexes as an option for the encoding?

@srsuryadev srsuryadev requested a review from xiaoxmeng April 8, 2026 18:17
@srsuryadev
Copy link
Copy Markdown
Contributor

Hi @David-C-L, Thank you! The PR is getting into a better shape! can we update the decode time and memory overhead as well along with the compression ratio in the same benchmark which you have used in the description? Thank you

void FrequencyPartitionEncoding<T>::readWithVisitor(
V& visitor,
ReadWithVisitorParams& params) {
detail::readWithVisitorSlow(visitor, params, nullptr, [&] {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for initial implementation, but we can try or will follow up if needed for readWithVisitorFast

}

template <typename T>
uint32_t FrequencyPartitionEncoding<T>::getTierForRow(uint32_t rowIndex) const {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can try some search optimizations here if possible.

@srsuryadev srsuryadev requested a review from pedroerp May 6, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants