Skip to content

feat(parquet): Add NaN statistics to Parquet writer#14725

Closed
PingLiuPing wants to merge 1 commit into
facebookincubator:mainfrom
PingLiuPing:lp_parquet_metadata_nan
Closed

feat(parquet): Add NaN statistics to Parquet writer#14725
PingLiuPing wants to merge 1 commit into
facebookincubator:mainfrom
PingLiuPing:lp_parquet_metadata_nan

Conversation

@PingLiuPing
Copy link
Copy Markdown
Collaborator

Add NaN statistic to Parquet writer

This change introduces a NaN statistic in the Parquet writer. Unlike other statistics (e.g., null_count, distinct_count), the NaN statistic is not written to the Parquet file footer. It is only reported to the Parquet writer caller when needed, such as when writing Iceberg Parquet data files.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2025
@netlify
Copy link
Copy Markdown

netlify Bot commented Sep 4, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 34b4af9
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/697e69665c2c4b0008e9e8ff

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak Can you help review this PR? Thank you very much.

max_value(false),
min_value(false) {}
bool max : 1;
bool min : 1;
bool null_count : 1;
bool distinct_count : 1;
bool nan_count : 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI: The Arrow Parquet community has been discussing NAN and there is a recent PR: https://github.com/apache/parquet-format/pull/514/files#top. The nan_count will be added as the 9th member of the Statistics class. Ideally this .cpp and .h file shall be auto generated, because editting this by hand is brittle and easy to get wrong. But I saw there were already some changes made to these generated cpp/h files, so I regard this as a temporary solution. Would you please add a comment here and reference https://github.com/apache/parquet-format/pull/514/files#top?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Thank you for the comments. Do you know the routine in velox that when parquet.thrift will be updated?
I will add a comment here by referencing https://github.com/apache/parquet-format/pull/514/files#top.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By checking the latest parquet.thrift, we have missed field 7 & 8.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment at line 525 where this data member is been introduced.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing It was an old version I downloaded years ago. We may want to update it and regenerate the cpp/h at some time in the near future.

@@ -677,6 +677,11 @@ void Statistics::__set_distinct_count(const int64_t val) {
__isset.distinct_count = true;
}

void Statistics::__set_nan_count(const int64_t val) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is for Statistics::read() on line 699 and Statistics::write() on line 776. If you add a new field to Statistics class, you'll have to update these two function implementation as well. Otherwise the nan count won't be read or written correctly.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Thanks, I get your point. As I mentioned in the PR description, this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing. Do you think it’s mandatory to include this field in the Parquet file?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it’s only used to count NaNs during writing.

It's good to have the conforming cpp/h file with the .thrift file. If we update .thrift but do not update the .cpp/.h, the code will be hard to maintain soon. The best way is to update .thrift and regenerate the cpp and h. If not doing it now, then at least keep the files agree with each other.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yingsu00 Thanks.
Since currently velox do not support auto generate these two files. Found this issue #1622 though. Do you agree to manually change parquet.thrift file first to match the changes in ParquetThriftTypes.h ?
Regarding upgrade parquet.thrift file, I think we need to consider:

  1. There are two commits each for ParquetThriftTypes.h and ParquetThriftTypes.cpp. When generate new version files, make sure to pick the changes back.
  2. Need to check arrow parquet reader and writer, there are lots of changes on the copied arrow writer and reader code. This might difficult to upgrade.

@yingsu00 yingsu00 self-requested a review September 6, 2025 03:53
@PingLiuPing PingLiuPing force-pushed the lp_parquet_metadata_nan branch from 709d1ae to 3f392dd Compare September 8, 2025 10:31
@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Sep 11, 2025

@PingLiuPing As Ying mentioned, we should not edit the ParquetThrift files. We should wait for https://github.com/apache/parquet-format/pull/514/files#top to land.

this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing

Can you clarify this statement? What is the purpose of this during writing?

Iceberg Parquet data files.

Are these different from regular Parquet files?

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

Thanks @majetideepak for the comment.

Regarding https://github.com/apache/parquet-format/pull/514/files#top, the catch here is suppose it is been merged. But from velox point of view how to pickup the changes from parquet-format? There are changes been made to ParquetThriftTypes.h and ParquetThriftTypes.cpp already. And should we also update arrow parquet writers?

this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing

Since in the velox version of parquet, the NaN is not a supported stats field. I just count this field on the fly and report it to Iceberg stats collector. It will not be wrote to the actual parquet metadata. So the parquet data file is identical with other data files.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak Regarding upgrade parquet version in velox, I see https://github.com/apache/parquet-format/pull/514/files#top, still not merged. And I think arrow also takes some time to support this.
Could you please advice what's the nest step? I can draft a PR to upgrade parquet version if needed.

@majetideepak
Copy link
Copy Markdown
Collaborator

report it to Iceberg stats collector.

How does this work?

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

report it to Iceberg stats collector.

How does this work?

Thanks,
The nan stats will be retrieved through iceberg stats collector.
You can refer to this code https://github.com/facebookincubator/velox/pull/14272/files#diff-a1e71e9d7b5dc2bccc4fbd79fdb26bd2b572398a59ce6dd1662b2d2a18eedc8dR90

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak Regarding

@PingLiuPing As Ying mentioned, we should not edit the ParquetThrift files. We should wait for https://github.com/apache/parquet-format/pull/514/files#top to land.

apache/parquet-format#514 still not been merged.
Just to check what's our strategy on this? Should we wait still?
And suppose it get merged what's the next step for us? Should we also wait for arrow to adapt that changes?

And then what's the steps in velox to upgrade arrow?
Could you please share some thoughts on this and I can do some preparation work if required.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

PingLiuPing commented Nov 15, 2025

@mbasmanova Could you share some thoughts on Velox's strategy of upgrading parquet writer code? Thank you very much.

@mbasmanova
Copy link
Copy Markdown
Contributor

@mbasmanova Could you share some thoughts on Velox's strategy of upgrading parquet writer code? Thank you very much.

@PingLiuPing What's the context for this question? Is there a summary I can read somewhere?

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@mbasmanova Thanks for looking at this.

Iceberg requires nan statistics to be collected during writing parquet data files. But current Velox parquet writer does not write such stats. The purpose of this PR is adding such support so that when writing parquet data files, nan stats can be collected.

During review, we found that the nan stats is been discussed and will be introduced from apache/parquet-format#514.
But the question to us are:

  1. Parquet community still not merged PR 514.
  2. Suppose parquet merged that PR, since Velox currently copy the parquet writer code from arrow, we still need to wait for arrow to support this.
  3. Regarding the parquet writer code, Velox has made few changes to those code. It seems not a straightforward cherry-pick from arrow.

@mbasmanova
Copy link
Copy Markdown
Contributor

@PingLiuPing Thank you for the context. Velox has its own copy of Parquet writer code, right? If so, there is no "upgrading" to discuss. Am I missing something? Would you ask the question again?

CC: @majetideepak Deepak, you used to be the owner of anything Parquet related? Are you still the owner? If so, what is your take?

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@mbasmanova Thanks.

Velox has its own copy of Parquet writer code, right? If so, there is no "upgrading" to discuss. Am I missing something? Would you ask the question again?

Sorry for the earlier confusion. Yes, Velox maintains its own copy of the Parquet writer code, but that code was originally imported from Arrow. When I mentioned “upgrading,” I was referring to what Ying and Deepak suggested: the Parquet file format specification has evolved, and we may need to keep Velox’s Parquet writer aligned with the latest format definition.

This PR currently collects nan stats on the fly without writing them into the Parquet file metadata, so no upgrade is required. But since there hasn’t been feedback from other reviewers for quite a while, I want to clarify the direction:

  1. Should we update Velox’s Parquet writer to match the latest Parquet format? If so, what is the general process for doing that?
  2. If we decide not to upgrade, is this PR acceptable as-is?

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Nov 17, 2025

@PingLiuPing As you mentioned, Velox forked the Arrow Writer code and adapted it to Velox APIs and coding conventions. We can copy new writer improvements if the Arrow Parquet community made any.
We can also update the parquet.thrift format as well if there is a need.

Should we update Velox’s Parquet writer to match the latest Parquet format? If so, what is the general process for doing that?

Can you clarify this further? What are the improvements made compared to the current state?

Regarding this change, you mentioned here #14725 (comment) that another PR will be consuming this. Can we land that PR first?
We should not be modifying the generated ParquetThrift files. Can we keep the NaN stats in the writer separately for now?

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

PingLiuPing commented Nov 17, 2025

Thanks @majetideepak

Can you clarify this further? What are the improvements made compared to the current state?

Intems of statistics, the latest parquet.thrift has these two fields.

   /** If true, max_value is the actual maximum value for a column */
   7: optional bool is_max_value_exact;
   /** If true, min_value is the actual minimum value for a column */
   8: optional bool is_min_value_exact;

Regarding this change, you mentioned here #14725 (comment) that another PR will be consuming this. Can we land that PR first?

Thanks, but it would be great if we support this in the lowest layer first.

We should not be modifying the generated ParquetThrift files. Can we keep the NaN stats in the writer separately for now?

Thanks, I can investigate this direction. But seems this way still seems not the best way as ideally the change should be made in ParquetThriftTypes.h and .cpp

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

Can we keep the NaN stats in the writer separately for now?

@majetideepak Would you please elaborate more on this? Thank you.

@PingLiuPing PingLiuPing force-pushed the lp_parquet_metadata_nan branch from 3f392dd to d89f8fe Compare December 30, 2025 14:38
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak I added nan_count to parquet.thrift and then generate .h and .cpp file based on that parquet.thrift file. Then I compared these two files with Velox ParquetThriftTypes.h and ParquetThriftTypes.cpp. I think what I have changed is similar to the file generate by thrift compiler, and to make it identical I copied nan_count related changes to ParquetThriftTypes.h and ParquetThriftTypes.cpp, other changes are irrelevant.
Could you help have a look? Thanks

@majetideepak
Copy link
Copy Markdown
Collaborator

@PingLiuPing We should not be modifying the parquet.thrift file directly. We risk breaking parquet compatibility.

Copy link
Copy Markdown
Collaborator

@majetideepak majetideepak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing How do other tools like DuckDB write iceberg files without this NaN statistics present in the parquet.thrift file?

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Jan 5, 2026

Can we keep the NaN stats in the writer separately for now?
@majetideepak Would you please elaborate more on this? Thank you.

The idea is to have a separate field in the (non-thrift) metadata object to store the NaN stats. But the usage of these NaN stats is not clear from this PR.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

PingLiuPing commented Jan 6, 2026

@PingLiuPing How do other tools like DuckDB write iceberg files without this NaN statistics present in the parquet.thrift file?

@majetideepak Thanks. For DuckDB-iceberg, they do not write this stat yet. They also do not support other stats at the moment. They have a PR to collect some of the stats (not NaN) duckdb/duckdb-iceberg#640.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

The idea is to have a separate field in the (non-thrift) metadata object to store the NaN stats.

Ok, I will try to see if this is possible.

But the usage of these NaN stats is not clear from this PR.

The stats are not used inside velox but it should be collected and saved to iceberg manifest file. It is used by upstream engine when planning.

@majetideepak
Copy link
Copy Markdown
Collaborator

majetideepak commented Jan 6, 2026

The stats are not used inside velox but it should be collected and saved to iceberg manifest file. It is used by upstream engine when planning.

I am reading this as the coordinator requires these stats for better planning. The NaN stats are not written to Parquet but to an Iceberg manifest file. We should aim to store these stats separately as they are not related to Parquet anyway.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

I am reading this as the coordinator requires these stats for better planning. The NaN stats are not written to Parquet but to an Iceberg manifest file. We should aim to store these stats separately as they are not related to Parquet anyway.

Yes, with the current paquet.thrift, NaN is not write to parquet file. Once https://github.com/apache/parquet-format/pull/514/files#top is merged and Velox parquet writer is upgraded NaN will be wrote to parquet file.

My initial design is just collect the NaN stats and report it back to coordinator node, and do not write NaN to parquet file.
My first version of code follows this design, but since it changed ParquetThriftTypes.h and ParquetThriftTypes.cpp and following the review comments and seems this is not the most correct way to do this. Nevertheless I noteced that ParquetThriftTypes.h and ParquetThriftTypes.cpp have been changed many times.

@majetideepak
Copy link
Copy Markdown
Collaborator

Nevertheless I noteced that ParquetThriftTypes.h and ParquetThriftTypes.cpp have been changed many times.

These files will anyway go away with this PR #14942

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

Nevertheless I noteced that ParquetThriftTypes.h and ParquetThriftTypes.cpp have been changed many times.

These files will anyway go away with this PR #14942

Oh, thanks. Let me investigate if there are ways to do this without code change in ParquetThriftTypes.h and ParquetThriftTypes.cpp

@PingLiuPing PingLiuPing force-pushed the lp_parquet_metadata_nan branch from d89f8fe to eecbff4 Compare January 7, 2026 11:59
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak I re-factored the code and removed all the changes relate to parquet.thrift. Would you have another look? Thank you.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak Could you please have a look? Thank you.

1 similar comment
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@majetideepak Could you please have a look? Thank you.

@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@rui-mo @mbasmanova Wondering if you could take a look at this PR when convenient.

@PingLiuPing PingLiuPing force-pushed the lp_parquet_metadata_nan branch from f876b2b to e6bba87 Compare January 30, 2026 16:17
@PingLiuPing PingLiuPing force-pushed the lp_parquet_metadata_nan branch from e6bba87 to 34b4af9 Compare January 31, 2026 20:43
@mbasmanova mbasmanova changed the title feat: Add NaN statistics to parquet writer feat: Add NaN statistics to Parquet writer Feb 4, 2026
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PingLiuPing Thanks.

It is hard to read this code because coding style is inconsistent and so different from Velox. If you plan to continue to develop this code, please, consider updating Parquet writer wholesale to follow Velox coding style?

CC: @pedroerp

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 4, 2026
@mbasmanova mbasmanova changed the title feat: Add NaN statistics to Parquet writer feat(parquet): Add NaN statistics to Parquet writer Feb 4, 2026
@PingLiuPing
Copy link
Copy Markdown
Collaborator Author

@PingLiuPing Thanks.

It is hard to read this code because coding style is inconsistent and so different from Velox. If you plan to continue to develop this code, please, consider updating Parquet writer wholesale to follow Velox coding style?

CC: @pedroerp

@mbasmanova Thank you so much for reviewing.

That makes sense. I’ll open an issue to track this and draft a plan to the Parquet writer to better align with Velox coding style.

@mbasmanova
Copy link
Copy Markdown
Contributor

I’ll open an issue to track this and draft a plan to the Parquet writer to better align with Velox coding style.

@PingLiuPing Thank you. I expect this can be done using an LLM relatively quickly.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Feb 6, 2026

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D92527330.

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Feb 7, 2026

@xiaoxmeng merged this pull request in 8c1a8aa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants