feat: Add iceberg data file statistics#17388
Conversation
✅ Deploy Preview for meta-velox canceled.
|
Build Impact AnalysisSelective Build Targets (building these covers all 511 affected)Total affected: 511/575 targets
Affected targets (511)Directly changed (18)
Transitively affected (493)
Slow path • Graph generated from PR branch |
25e85e8 to
434217a
Compare
CI Failure Analysis
🔴 Expression Fuzzer with Presto SOT — FUZZER Failure View logsFailed instances: All 4 instances failed (seeds: 59238549, 932524077, 400533294, 254235930) All instances failed with The mismatch data involves map columns with key-value pairs (BIGINT keys, DOUBLE values), indicating a result comparison failure between Velox evaluation and the Presto reference server. 🔴 Join Fuzzer — FUZZER Failure View logsFailed instances: 2 of 4 instances failed (seeds: 226857074, 292938254); instances 2 and 3 passed. Instance 1 (seed=226857074): Row mismatch in map column — a single key value differs by a small amount ( Instance 4 (seed=292938254): Similar 🔴 Window Fuzzer with Presto as source of truth — FUZZER Failure View logsFailed instances: 2 of 4 instances failed (seeds: 233822625, 432741263); instances 1 and 3 passed. Instance 2 (seed=233822625): Instance 4 (seed=432741263): Both involve Correlation with PR changes: The PR (#17388) modifies only:
None of these files are related to expression evaluation, join execution, window functions, or TIMESTAMP WITH TIME ZONE handling. The fuzzer failures are in the expression evaluation layer ( Known issues:
These are pre-existing/flaky failures unrelated to this PR. Reproduce locally: # Expression Fuzzer (any seed from: 59238549, 932524077, 400533294, 254235930)
./_build/debug/velox/expression/fuzzer/velox_expression_fuzzer_test \
--seed 59238549 --duration_sec 60 --logtostderr --minloglevel=0
# Join Fuzzer (seeds: 226857074, 292938254)
./_build/debug/velox/exec/fuzzer/velox_join_fuzzer_test \
--seed 226857074 --duration_sec 60 --logtostderr --minloglevel=0
# Window Fuzzer (seeds: 233822625, 432741263)
./_build/debug/velox/functions/prestosql/fuzzer/velox_window_fuzzer_test \
--seed 233822625 --duration_sec 60 --logtostderr --minloglevel=0Note: Reproducing these failures locally requires a running Presto server as the reference source of truth (the fuzzers compare Velox results against Presto query results). Recommended fix: No action needed from this PR. These are pre-existing fuzzer failures on |
fd6409c to
5a2a746
Compare
There was a problem hiding this comment.
Pull request overview
This PR restores and extends Iceberg Parquet data-file statistics collection and fixes failures caused by applying UTF-8 “round up” logic to non-UTF8 VARBINARY values. It also introduces a cross-format FileMetadata return from dwio::common::Writer::close() so Iceberg can aggregate Parquet row-group metadata into Iceberg commit-task metrics.
Changes:
- Add
roundUpBinary()and use it for Parquet BYTE_ARRAY upper-bound computation when the column is logical BINARY/VARBINARY (avoids UTF-8 validation/infinite-loop behavior). - Introduce
dwio::common::FileMetadataand change writerclose()APIs to return format-specific metadata (implemented for Parquet/DWRF/Text and plumbed throughSortingWriter). - Add Iceberg stats plumbing (
IcebergDataFileStatistics,IcebergParquetStatsCollector) and comprehensive Parquet stats tests; update Iceberg commit message metrics to include full Iceberg metrics JSON.
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| velox/functions/lib/string/StringImpl.h | Adds roundUpBinary() for Iceberg-style binary upper bounds. |
| velox/dwio/text/writer/TextWriter.h | Updates writer close() to return FileMetadata; adds placeholder metadata type. |
| velox/dwio/text/writer/TextWriter.cpp | Returns TextFileMetadata from close(). |
| velox/dwio/parquet/writer/WriterConfig.h | Adds Parquet writer config constants split out from Arrow-heavy headers. |
| velox/dwio/parquet/writer/Writer.h | Adds ParquetFileMetadata, makes close() return metadata, and wires WriterOptions to WriterConfig. |
| velox/dwio/parquet/writer/Writer.cpp | Returns ParquetFileMetadata from close() after flushing/closing Arrow writer. |
| velox/dwio/parquet/writer/CMakeLists.txt | Exposes WriterConfig.h as interface headers. |
| velox/dwio/parquet/writer/arrow/Statistics.cpp | Chooses UTF-8 vs binary rounding for BYTE_ARRAY upper bounds based on logical type. |
| velox/dwio/dwrf/writer/Writer.h | Updates close() signature and adds placeholder metadata type. |
| velox/dwio/dwrf/writer/Writer.cpp | Returns DwrfFileMetadata from close(). |
| velox/dwio/common/Writer.h | Changes Writer::close() to return std::unique_ptr<FileMetadata>. |
| velox/dwio/common/tests/WriterTest.cpp | Updates mock writer to match new close() signature. |
| velox/dwio/common/tests/SortingWriterTest.cpp | Updates mock writer to match new close() signature. |
| velox/dwio/common/SortingWriter.h | Updates close() signature and documents metadata return. |
| velox/dwio/common/SortingWriter.cpp | Propagates metadata return from wrapped writer. |
| velox/dwio/common/FileMetadata.h | Adds new base class for format-specific file metadata. |
| velox/dwio/common/CMakeLists.txt | Adds FileMetadata.h to the common library headers. |
| velox/connectors/hive/iceberg/tests/IcebergParquetStatsTest.cpp | Adds end-to-end tests for Parquet Iceberg stats (bounds/counts/nulls/nans/nested). |
| velox/connectors/hive/iceberg/tests/CMakeLists.txt | Registers the new Parquet stats test. |
| velox/connectors/hive/iceberg/IcebergParquetStatsCollector.h | Declares aggregator from Parquet metadata to Iceberg metrics. |
| velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp | Implements stats aggregation across row groups and bounds handling rules. |
| velox/connectors/hive/iceberg/IcebergDataSink.h | Adds stats storage, Parquet stats collector, and overrides rotation/close to capture metadata. |
| velox/connectors/hive/iceberg/IcebergDataSink.cpp | Populates commit-task metrics from collected file stats and captures writer metadata on rotate/close. |
| velox/connectors/hive/iceberg/IcebergDataFileStatistics.h | Introduces an Iceberg metrics struct with JSON serialization. |
| velox/connectors/hive/iceberg/IcebergDataFileStatistics.cpp | Implements toJson() for commit-task metrics. |
| velox/connectors/hive/iceberg/CMakeLists.txt | Builds/links the new Iceberg stats sources (conditionally for Parquet). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e0e8e95 to
cd8ed81
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
@mohsaka Thank you for the updates!
Re roundUpBinary: StringImpl.h is a general-purpose string utilities header. roundUpBinary is Iceberg-specific (the docstring even says "used in Apache Iceberg"). Placing it there pollutes a general header with domain-specific logic. The only caller is Statistics.cpp — please move roundUpBinary there as a free function in an anonymous namespace. The concern about Iceberg code in the Parquet area is moot since the dependency already exists. You can still have unit tests — just place both the functions and the tests next to Statistics.cpp, not in a different module. Same applies to roundUpUtf8 — its only caller is also Statistics.cpp. Since you're already touching this code, please move both functions there.
Re WriterConfig: I checked D93449945 (PR #16062) — WriterConfig exists because Gluten's WholeStageResultIterator.cc needs the config constants but can't include Writer.h due to Arrow header conflicts. That's a legitimate reason to keep it. However, WriterOptions should not inherit from WriterConfig — please drop the multiple inheritance. All callers should reference the constants as WriterConfig::kFoo (there are ~31 usages, all in ParquetWriterTest.cpp — a straightforward find-and-replace). Please also add a comment in WriterConfig.h explaining why these constants must stay in a separate header.
Landing risk: This PR changes the Writer::close() signature from virtual void close() to virtual std::unique_ptr<FileMetadata> close() in dwio/common/Writer.h. PR #16062 attempted the same change and it broke Gluten builds and caused vtable mismatches (heap-buffer-overflow in TextWriter::close()), blocked a release, and required VELOX_ENABLE_BACKWARD_COMPATIBILITY macro workarounds across dozens of internal BUCK targets. PR #16062 was merged and then reverted (PR #16999), so none of those workarounds are in place.
Please split out the API changes into a small prep PR: the Writer::close() signature change and WriterConfig.h. We'll land it internally and work through any build issues (Gluten, Axiom, etc.) before adding the Iceberg stats on top.
|
FYI, Folks I am working on fixing the netlify errors - Hoping to have a PR soon ( within a day). |
|
Prep-PR opened here: |
…Config constants (#17509) Summary: This PR makes two improvements to the Velox writer infrastructure: **1. Return FileMetadata from Writer::close()** - Modified `Writer::close()` to return `std::unique_ptr<FileMetadata>` instead of void - Added `FileMetadata` base class and format-specific implementations (`ParquetFileMetadata`, `TextFileMetadata`) - Enables callers to access file-level statistics and metadata after writing - Returns nullptr for empty files **2. Add WriterConfig constants** - Created new `WriterConfig.h` header with Parquet writer configuration constants - Allows external projects (e.g., Gluten) to access config constants without Arrow dependencies - Updated all test references to use new constants All existing tests updated and passing. Prep-PR for #17388. Pull Request resolved: #17509 Reviewed By: apurva-meta Differential Revision: D105173381 Pulled By: mbasmanova fbshipit-source-id: 39bfbbd2445ea73291c8551e142e79120b8cefde
@mbasmanova Thank you for merging in the other two comments. I've addressed the last one in the newest commit of this PR. I couldn't quite wrap my head around putting the functions into Statistics.cpp and as well as the test cases? That being said, I've moved out the Iceberg statistics functions from StringImpl.h and moved them into a dedicated header under I did attempt to create an anonymous namespace with these functions in Statistics.cpp but that made them only accessible from within the cpp file itself. Therefore I was not able to access them from Could you let me know if this is okay? Or you had something else in mind. Thank you! |
95e10de to
1f5df19
Compare
|
@mohsaka Thank you for the update! A few issues with the current placement:
|
|
@mbasmanova Thank you for the feedback again! I couldn't find Addressed all of the other comments. Please take another look when you have the chance! |
mbasmanova
left a comment
There was a problem hiding this comment.
Thank you for addressing all the feedback!
|
Hi @mohsaka, I'm landing fixes on top of your PR (D104861208 / #17388) to clear the import-side CI failures and address a few inline review items. Heads-up so you're not surprised when the merged commit shows additional OSS-visible changes.
Let me know if any of these don't sit right. |
|
@mbasmanova Everything looks fine to me. Thank you! |
|
@mbasmanova merged this pull request in fd130f4. |
Adds Iceberg data file statistics collection (column sizes, value counts, null/NaN counts, min/max bounds) to the Parquet writer path. Statistics are aggregated from Parquet row group metadata after each file is closed and included in the Iceberg commit message.
Re-lands #16062 and #16867, which were reverted in #16999 due to
roundUpUtf8entering an infinite loop on non-UTF8 varbinary data. The fix addsroundUpBinaryfor computing upper bounds on raw binary data (follows Iceberg'sBinaryUtil.truncateBinaryMax()).Statistics.cppnow dispatches toroundUpUtf8for STRING orroundUpBinaryfor BINARY based on the Parquet logical type.Additional fixes:
dynamic_pointer_castinstead ofcheckedPointerCastfor writer options.toManifestFormatString()instead of hardcoded"PARQUET".Added a roundUpBinary function based off of iceberg's truncateBinaryMax taken from:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/util/BinaryUtil.java
Prestissimo CI/CD:
prestodb/presto#27702
Some additional comments from the original backed out PRs:
Writer close function now returns an optional
virtual std::unique_ptr<FileMetadata>. When the code got reverted we were already at FileMetadata:#16801
However, when it was reverted, we switched back to void. We now switch back once again to FileMetadata.