feat(parquet): Add NaN statistics to Parquet writer by PingLiuPing · Pull Request #14725 · facebookincubator/velox

PingLiuPing · 2025-09-04T14:46:46Z

Add NaN statistic to Parquet writer

This change introduces a NaN statistic in the Parquet writer. Unlike other statistics (e.g., null_count, distinct_count), the NaN statistic is not written to the Parquet file footer. It is only reported to the Parquet writer caller when needed, such as when writing Iceberg Parquet data files.

netlify · 2025-09-04T14:46:52Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`34b4af9`
🔍 Latest deploy log	https://app.netlify.com/projects/meta-velox/deploys/697e69665c2c4b0008e9e8ff

PingLiuPing · 2025-09-05T17:28:46Z

@majetideepak Can you help review this PR? Thank you very much.

yingsu00 · 2025-09-06T00:36:50Z

        max_value(false),
        min_value(false) {}
  bool max : 1;
  bool min : 1;
  bool null_count : 1;
  bool distinct_count : 1;
+  bool nan_count : 1;


Just FYI: The Arrow Parquet community has been discussing NAN and there is a recent PR: https://github.com/apache/parquet-format/pull/514/files#top. The nan_count will be added as the 9th member of the Statistics class. Ideally this .cpp and .h file shall be auto generated, because editting this by hand is brittle and easy to get wrong. But I saw there were already some changes made to these generated cpp/h files, so I regard this as a temporary solution. Would you please add a comment here and reference https://github.com/apache/parquet-format/pull/514/files#top?

@yingsu00 Thank you for the comments. Do you know the routine in velox that when parquet.thrift will be updated?
I will add a comment here by referencing https://github.com/apache/parquet-format/pull/514/files#top.

By checking the latest parquet.thrift, we have missed field 7 & 8.

I added a comment at line 525 where this data member is been introduced.

@PingLiuPing It was an old version I downloaded years ago. We may want to update it and regenerate the cpp/h at some time in the near future.

yingsu00 · 2025-09-06T00:51:11Z

@@ -677,6 +677,11 @@ void Statistics::__set_distinct_count(const int64_t val) {
  __isset.distinct_count = true;
 }

+void Statistics::__set_nan_count(const int64_t val) {


This comment is for Statistics::read() on line 699 and Statistics::write() on line 776. If you add a new field to Statistics class, you'll have to update these two function implementation as well. Otherwise the nan count won't be read or written correctly.

@yingsu00 Thanks, I get your point. As I mentioned in the PR description, this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing. Do you think it’s mandatory to include this field in the Parquet file?

it’s only used to count NaNs during writing.

It's good to have the conforming cpp/h file with the .thrift file. If we update .thrift but do not update the .cpp/.h, the code will be hard to maintain soon. The best way is to update .thrift and regenerate the cpp and h. If not doing it now, then at least keep the files agree with each other.

@yingsu00 Thanks.
Since currently velox do not support auto generate these two files. Found this issue #1622 though. Do you agree to manually change parquet.thrift file first to match the changes in ParquetThriftTypes.h ?
Regarding upgrade parquet.thrift file, I think we need to consider:

There are two commits each for ParquetThriftTypes.h and ParquetThriftTypes.cpp. When generate new version files, make sure to pick the changes back.

Need to check arrow parquet reader and writer, there are lots of changes on the copied arrow writer and reader code. This might difficult to upgrade.

majetideepak · 2025-09-11T15:10:37Z

@PingLiuPing As Ying mentioned, we should not edit the ParquetThrift files. We should wait for https://github.com/apache/parquet-format/pull/514/files#top to land.

this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing

Can you clarify this statement? What is the purpose of this during writing?

Iceberg Parquet data files.

Are these different from regular Parquet files?

PingLiuPing · 2025-09-11T15:40:01Z

Thanks @majetideepak for the comment.

Regarding https://github.com/apache/parquet-format/pull/514/files#top, the catch here is suppose it is been merged. But from velox point of view how to pickup the changes from parquet-format? There are changes been made to ParquetThriftTypes.h and ParquetThriftTypes.cpp already. And should we also update arrow parquet writers?

this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing

Since in the velox version of parquet, the NaN is not a supported stats field. I just count this field on the fly and report it to Iceberg stats collector. It will not be wrote to the actual parquet metadata. So the parquet data file is identical with other data files.

PingLiuPing · 2025-09-17T13:20:07Z

@majetideepak Regarding upgrade parquet version in velox, I see https://github.com/apache/parquet-format/pull/514/files#top, still not merged. And I think arrow also takes some time to support this.
Could you please advice what's the nest step? I can draft a PR to upgrade parquet version if needed.

majetideepak · 2025-09-17T14:38:39Z

report it to Iceberg stats collector.

How does this work?

PingLiuPing · 2025-09-17T15:22:23Z

report it to Iceberg stats collector.

How does this work?

Thanks,
The nan stats will be retrieved through iceberg stats collector.
You can refer to this code https://github.com/facebookincubator/velox/pull/14272/files#diff-a1e71e9d7b5dc2bccc4fbd79fdb26bd2b572398a59ce6dd1662b2d2a18eedc8dR90

PingLiuPing · 2025-10-09T15:43:25Z

@majetideepak Regarding

@PingLiuPing As Ying mentioned, we should not edit the ParquetThrift files. We should wait for https://github.com/apache/parquet-format/pull/514/files#top to land.

apache/parquet-format#514 still not been merged.
Just to check what's our strategy on this? Should we wait still?
And suppose it get merged what's the next step for us? Should we also wait for arrow to adapt that changes?

And then what's the steps in velox to upgrade arrow?
Could you please share some thoughts on this and I can do some preparation work if required.

PingLiuPing · 2025-11-15T21:39:22Z

@mbasmanova Could you share some thoughts on Velox's strategy of upgrading parquet writer code? Thank you very much.

mbasmanova · 2025-11-17T15:23:58Z

@mbasmanova Could you share some thoughts on Velox's strategy of upgrading parquet writer code? Thank you very much.

@PingLiuPing What's the context for this question? Is there a summary I can read somewhere?

PingLiuPing · 2025-11-17T15:34:33Z

@mbasmanova Thanks for looking at this.

Iceberg requires nan statistics to be collected during writing parquet data files. But current Velox parquet writer does not write such stats. The purpose of this PR is adding such support so that when writing parquet data files, nan stats can be collected.

During review, we found that the nan stats is been discussed and will be introduced from apache/parquet-format#514.
But the question to us are:

Parquet community still not merged PR 514.
Suppose parquet merged that PR, since Velox currently copy the parquet writer code from arrow, we still need to wait for arrow to support this.
Regarding the parquet writer code, Velox has made few changes to those code. It seems not a straightforward cherry-pick from arrow.

mbasmanova · 2025-11-17T15:40:23Z

@PingLiuPing Thank you for the context. Velox has its own copy of Parquet writer code, right? If so, there is no "upgrading" to discuss. Am I missing something? Would you ask the question again?

CC: @majetideepak Deepak, you used to be the owner of anything Parquet related? Are you still the owner? If so, what is your take?

PingLiuPing · 2025-11-17T16:05:17Z

@mbasmanova Thanks.

Velox has its own copy of Parquet writer code, right? If so, there is no "upgrading" to discuss. Am I missing something? Would you ask the question again?

Sorry for the earlier confusion. Yes, Velox maintains its own copy of the Parquet writer code, but that code was originally imported from Arrow. When I mentioned “upgrading,” I was referring to what Ying and Deepak suggested: the Parquet file format specification has evolved, and we may need to keep Velox’s Parquet writer aligned with the latest format definition.

This PR currently collects nan stats on the fly without writing them into the Parquet file metadata, so no upgrade is required. But since there hasn’t been feedback from other reviewers for quite a while, I want to clarify the direction:

Should we update Velox’s Parquet writer to match the latest Parquet format? If so, what is the general process for doing that?
If we decide not to upgrade, is this PR acceptable as-is?

majetideepak · 2025-11-17T16:17:37Z

@PingLiuPing As you mentioned, Velox forked the Arrow Writer code and adapted it to Velox APIs and coding conventions. We can copy new writer improvements if the Arrow Parquet community made any.
We can also update the parquet.thrift format as well if there is a need.

Should we update Velox’s Parquet writer to match the latest Parquet format? If so, what is the general process for doing that?

Can you clarify this further? What are the improvements made compared to the current state?

Regarding this change, you mentioned here #14725 (comment) that another PR will be consuming this. Can we land that PR first?
We should not be modifying the generated ParquetThrift files. Can we keep the NaN stats in the writer separately for now?

PingLiuPing · 2025-11-17T16:49:17Z

Thanks @majetideepak

Can you clarify this further? What are the improvements made compared to the current state?

Intems of statistics, the latest parquet.thrift has these two fields.

   /** If true, max_value is the actual maximum value for a column */
   7: optional bool is_max_value_exact;
   /** If true, min_value is the actual minimum value for a column */
   8: optional bool is_min_value_exact;

Regarding this change, you mentioned here #14725 (comment) that another PR will be consuming this. Can we land that PR first?

Thanks, but it would be great if we support this in the lowest layer first.

We should not be modifying the generated ParquetThrift files. Can we keep the NaN stats in the writer separately for now?

Thanks, I can investigate this direction. But seems this way still seems not the best way as ideally the change should be made in ParquetThriftTypes.h and .cpp

PingLiuPing · 2025-11-24T12:50:05Z

Can we keep the NaN stats in the writer separately for now?

@majetideepak Would you please elaborate more on this? Thank you.

PingLiuPing · 2025-12-30T14:43:16Z

@majetideepak I added nan_count to parquet.thrift and then generate .h and .cpp file based on that parquet.thrift file. Then I compared these two files with Velox ParquetThriftTypes.h and ParquetThriftTypes.cpp. I think what I have changed is similar to the file generate by thrift compiler, and to make it identical I copied nan_count related changes to ParquetThriftTypes.h and ParquetThriftTypes.cpp, other changes are irrelevant.
Could you help have a look? Thanks

majetideepak · 2026-01-05T20:54:33Z

@PingLiuPing We should not be modifying the parquet.thrift file directly. We risk breaking parquet compatibility.

majetideepak

@PingLiuPing How do other tools like DuckDB write iceberg files without this NaN statistics present in the parquet.thrift file?

majetideepak · 2026-01-05T22:05:28Z

Can we keep the NaN stats in the writer separately for now?
@majetideepak Would you please elaborate more on this? Thank you.

The idea is to have a separate field in the (non-thrift) metadata object to store the NaN stats. But the usage of these NaN stats is not clear from this PR.

PingLiuPing · 2026-01-06T14:47:25Z

@PingLiuPing How do other tools like DuckDB write iceberg files without this NaN statistics present in the parquet.thrift file?

@majetideepak Thanks. For DuckDB-iceberg, they do not write this stat yet. They also do not support other stats at the moment. They have a PR to collect some of the stats (not NaN) duckdb/duckdb-iceberg#640.

PingLiuPing · 2026-01-06T14:49:33Z

The idea is to have a separate field in the (non-thrift) metadata object to store the NaN stats.

Ok, I will try to see if this is possible.

But the usage of these NaN stats is not clear from this PR.

The stats are not used inside velox but it should be collected and saved to iceberg manifest file. It is used by upstream engine when planning.

majetideepak · 2026-01-06T14:53:08Z

The stats are not used inside velox but it should be collected and saved to iceberg manifest file. It is used by upstream engine when planning.

I am reading this as the coordinator requires these stats for better planning. The NaN stats are not written to Parquet but to an Iceberg manifest file. We should aim to store these stats separately as they are not related to Parquet anyway.

PingLiuPing · 2026-01-06T15:10:56Z

I am reading this as the coordinator requires these stats for better planning. The NaN stats are not written to Parquet but to an Iceberg manifest file. We should aim to store these stats separately as they are not related to Parquet anyway.

Yes, with the current paquet.thrift, NaN is not write to parquet file. Once https://github.com/apache/parquet-format/pull/514/files#top is merged and Velox parquet writer is upgraded NaN will be wrote to parquet file.

My initial design is just collect the NaN stats and report it back to coordinator node, and do not write NaN to parquet file.
My first version of code follows this design, but since it changed ParquetThriftTypes.h and ParquetThriftTypes.cpp and following the review comments and seems this is not the most correct way to do this. Nevertheless I noteced that ParquetThriftTypes.h and ParquetThriftTypes.cpp have been changed many times.

majetideepak · 2026-01-06T16:45:05Z

Nevertheless I noteced that ParquetThriftTypes.h and ParquetThriftTypes.cpp have been changed many times.

These files will anyway go away with this PR #14942

PingLiuPing · 2026-01-06T17:01:17Z

Nevertheless I noteced that ParquetThriftTypes.h and ParquetThriftTypes.cpp have been changed many times.

These files will anyway go away with this PR #14942

Oh, thanks. Let me investigate if there are ways to do this without code change in ParquetThriftTypes.h and ParquetThriftTypes.cpp

PingLiuPing · 2026-01-19T11:07:48Z

@majetideepak I re-factored the code and removed all the changes relate to parquet.thrift. Would you have another look? Thank you.

PingLiuPing · 2026-01-21T12:30:35Z

@majetideepak Could you please have a look? Thank you.

PingLiuPing · 2026-01-27T20:53:56Z

@majetideepak Could you please have a look? Thank you.

PingLiuPing · 2026-01-29T22:38:01Z

@rui-mo @mbasmanova Wondering if you could take a look at this PR when convenient.

mbasmanova

@PingLiuPing Thanks.

It is hard to read this code because coding style is inconsistent and so different from Velox. If you plan to continue to develop this code, please, consider updating Parquet writer wholesale to follow Velox coding style?

CC: @pedroerp

PingLiuPing · 2026-02-04T09:24:23Z

@PingLiuPing Thanks.

It is hard to read this code because coding style is inconsistent and so different from Velox. If you plan to continue to develop this code, please, consider updating Parquet writer wholesale to follow Velox coding style?

CC: @pedroerp

@mbasmanova Thank you so much for reviewing.

That makes sense. I’ll open an issue to track this and draft a plan to the Parquet writer to better align with Velox coding style.

mbasmanova · 2026-02-04T09:45:21Z

I’ll open an issue to track this and draft a plan to the Parquet writer to better align with Velox coding style.

@PingLiuPing Thank you. I expect this can be done using an LLM relatively quickly.

meta-codesync · 2026-02-06T16:50:19Z

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D92527330.

meta-codesync · 2026-02-07T05:33:08Z

@xiaoxmeng merged this pull request in 8c1a8aa.

PingLiuPing requested review from jinchengchenghh and mbasmanova September 4, 2025 14:46

PingLiuPing requested a review from majetideepak as a code owner September 4, 2025 14:46

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 4, 2025

jinchengchenghh approved these changes Sep 5, 2025

View reviewed changes

yingsu00 suggested changes Sep 6, 2025

View reviewed changes

yingsu00 self-requested a review September 6, 2025 03:53

PingLiuPing force-pushed the lp_parquet_metadata_nan branch from 709d1ae to 3f392dd Compare September 8, 2025 10:31

PingLiuPing force-pushed the lp_parquet_metadata_nan branch from 3f392dd to d89f8fe Compare December 30, 2025 14:38

majetideepak requested changes Jan 5, 2026

View reviewed changes

PingLiuPing force-pushed the lp_parquet_metadata_nan branch from d89f8fe to eecbff4 Compare January 7, 2026 11:59

PingLiuPing force-pushed the lp_parquet_metadata_nan branch from eecbff4 to f876b2b Compare January 29, 2026 22:32

PingLiuPing requested a review from majetideepak January 29, 2026 22:36

PingLiuPing force-pushed the lp_parquet_metadata_nan branch from f876b2b to e6bba87 Compare January 30, 2026 16:17

Add NaN statistics to parquet writer.

34b4af9

PingLiuPing force-pushed the lp_parquet_metadata_nan branch from e6bba87 to 34b4af9 Compare January 31, 2026 20:43

mbasmanova changed the title ~~feat: Add NaN statistics to parquet writer~~ feat: Add NaN statistics to Parquet writer Feb 4, 2026

mbasmanova approved these changes Feb 4, 2026

View reviewed changes

mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label Feb 4, 2026

mbasmanova changed the title ~~feat: Add NaN statistics to Parquet writer~~ feat(parquet): Add NaN statistics to Parquet writer Feb 4, 2026

PingLiuPing mentioned this pull request Feb 4, 2026

[parquet] refactor parquet writer code to align with Velox coding standard #16238

Closed

meta-codesync Bot closed this in 8c1a8aa Feb 7, 2026

facebook-github-bot added the Merged label Feb 7, 2026

PingLiuPing mentioned this pull request Feb 10, 2026

feat: Collect Iceberg stats #16062

Closed

Conversation

PingLiuPing commented Sep 4, 2025

Uh oh!

netlify Bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

PingLiuPing commented Sep 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majetideepak commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Sep 11, 2025

Uh oh!

PingLiuPing commented Sep 17, 2025

Uh oh!

majetideepak commented Sep 17, 2025

Uh oh!

PingLiuPing commented Sep 17, 2025

Uh oh!

PingLiuPing commented Oct 9, 2025

Uh oh!

PingLiuPing commented Nov 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbasmanova commented Nov 17, 2025

Uh oh!

PingLiuPing commented Nov 17, 2025

Uh oh!

mbasmanova commented Nov 17, 2025

Uh oh!

PingLiuPing commented Nov 17, 2025

Uh oh!

majetideepak commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Nov 24, 2025

Uh oh!

PingLiuPing commented Dec 30, 2025

Uh oh!

majetideepak commented Jan 5, 2026

Uh oh!

majetideepak left a comment

Choose a reason for hiding this comment

Uh oh!

majetideepak commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Jan 6, 2026

Uh oh!

majetideepak commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PingLiuPing commented Jan 6, 2026

Uh oh!

majetideepak commented Jan 6, 2026

netlify Bot commented Sep 4, 2025 •

edited

Loading

majetideepak commented Sep 11, 2025 •

edited

Loading

PingLiuPing commented Nov 15, 2025 •

edited

Loading

majetideepak commented Nov 17, 2025 •

edited

Loading

PingLiuPing commented Nov 17, 2025 •

edited

Loading

majetideepak commented Jan 5, 2026 •

edited

Loading

PingLiuPing commented Jan 6, 2026 •

edited

Loading

majetideepak commented Jan 6, 2026 •

edited

Loading