feat(parquet): Add NaN statistics to Parquet writer#14725
Conversation
✅ Deploy Preview for meta-velox canceled.
|
|
@majetideepak Can you help review this PR? Thank you very much. |
| max_value(false), | ||
| min_value(false) {} | ||
| bool max : 1; | ||
| bool min : 1; | ||
| bool null_count : 1; | ||
| bool distinct_count : 1; | ||
| bool nan_count : 1; |
There was a problem hiding this comment.
Just FYI: The Arrow Parquet community has been discussing NAN and there is a recent PR: https://github.com/apache/parquet-format/pull/514/files#top. The nan_count will be added as the 9th member of the Statistics class. Ideally this .cpp and .h file shall be auto generated, because editting this by hand is brittle and easy to get wrong. But I saw there were already some changes made to these generated cpp/h files, so I regard this as a temporary solution. Would you please add a comment here and reference https://github.com/apache/parquet-format/pull/514/files#top?
There was a problem hiding this comment.
@yingsu00 Thank you for the comments. Do you know the routine in velox that when parquet.thrift will be updated?
I will add a comment here by referencing https://github.com/apache/parquet-format/pull/514/files#top.
There was a problem hiding this comment.
By checking the latest parquet.thrift, we have missed field 7 & 8.
There was a problem hiding this comment.
I added a comment at line 525 where this data member is been introduced.
There was a problem hiding this comment.
@PingLiuPing It was an old version I downloaded years ago. We may want to update it and regenerate the cpp/h at some time in the near future.
| @@ -677,6 +677,11 @@ void Statistics::__set_distinct_count(const int64_t val) { | |||
| __isset.distinct_count = true; | |||
| } | |||
|
|
|||
| void Statistics::__set_nan_count(const int64_t val) { | |||
There was a problem hiding this comment.
This comment is for Statistics::read() on line 699 and Statistics::write() on line 776. If you add a new field to Statistics class, you'll have to update these two function implementation as well. Otherwise the nan count won't be read or written correctly.
There was a problem hiding this comment.
@yingsu00 Thanks, I get your point. As I mentioned in the PR description, this NaN statistic will not be written to the Parquet file footer; it’s only used to count NaNs during writing. Do you think it’s mandatory to include this field in the Parquet file?
There was a problem hiding this comment.
it’s only used to count NaNs during writing.
It's good to have the conforming cpp/h file with the .thrift file. If we update .thrift but do not update the .cpp/.h, the code will be hard to maintain soon. The best way is to update .thrift and regenerate the cpp and h. If not doing it now, then at least keep the files agree with each other.
There was a problem hiding this comment.
@yingsu00 Thanks.
Since currently velox do not support auto generate these two files. Found this issue #1622 though. Do you agree to manually change parquet.thrift file first to match the changes in ParquetThriftTypes.h ?
Regarding upgrade parquet.thrift file, I think we need to consider:
- There are two commits each for
ParquetThriftTypes.handParquetThriftTypes.cpp. When generate new version files, make sure to pick the changes back. - Need to check arrow parquet reader and writer, there are lots of changes on the copied arrow writer and reader code. This might difficult to upgrade.
709d1ae to
3f392dd
Compare
|
@PingLiuPing As Ying mentioned, we should not edit the ParquetThrift files. We should wait for https://github.com/apache/parquet-format/pull/514/files#top to land.
Can you clarify this statement? What is the purpose of this during writing?
Are these different from regular Parquet files? |
|
Thanks @majetideepak for the comment. Regarding https://github.com/apache/parquet-format/pull/514/files#top, the catch here is suppose it is been merged. But from velox point of view how to pickup the changes from parquet-format? There are changes been made to ParquetThriftTypes.h and ParquetThriftTypes.cpp already. And should we also update arrow parquet writers?
Since in the velox version of parquet, the NaN is not a supported stats field. I just count this field on the fly and report it to Iceberg stats collector. It will not be wrote to the actual parquet metadata. So the parquet data file is identical with other data files. |
|
@majetideepak Regarding upgrade parquet version in velox, I see https://github.com/apache/parquet-format/pull/514/files#top, still not merged. And I think arrow also takes some time to support this. |
How does this work? |
Thanks, |
|
@majetideepak Regarding
apache/parquet-format#514 still not been merged. And then what's the steps in velox to upgrade arrow? |
|
@mbasmanova Could you share some thoughts on Velox's strategy of upgrading parquet writer code? Thank you very much. |
@PingLiuPing What's the context for this question? Is there a summary I can read somewhere? |
|
@mbasmanova Thanks for looking at this. Iceberg requires nan statistics to be collected during writing parquet data files. But current Velox parquet writer does not write such stats. The purpose of this PR is adding such support so that when writing parquet data files, nan stats can be collected. During review, we found that the nan stats is been discussed and will be introduced from apache/parquet-format#514.
|
|
@PingLiuPing Thank you for the context. Velox has its own copy of Parquet writer code, right? If so, there is no "upgrading" to discuss. Am I missing something? Would you ask the question again? CC: @majetideepak Deepak, you used to be the owner of anything Parquet related? Are you still the owner? If so, what is your take? |
|
@mbasmanova Thanks.
Sorry for the earlier confusion. Yes, Velox maintains its own copy of the Parquet writer code, but that code was originally imported from Arrow. When I mentioned “upgrading,” I was referring to what Ying and Deepak suggested: the Parquet file format specification has evolved, and we may need to keep Velox’s Parquet writer aligned with the latest format definition. This PR currently collects nan stats on the fly without writing them into the Parquet file metadata, so no upgrade is required. But since there hasn’t been feedback from other reviewers for quite a while, I want to clarify the direction:
|
|
@PingLiuPing As you mentioned, Velox forked the Arrow Writer code and adapted it to Velox APIs and coding conventions. We can copy new writer improvements if the Arrow Parquet community made any.
Can you clarify this further? What are the improvements made compared to the current state? Regarding this change, you mentioned here #14725 (comment) that another PR will be consuming this. Can we land that PR first? |
|
Thanks @majetideepak
Intems of statistics, the latest parquet.thrift has these two fields.
Thanks, but it would be great if we support this in the lowest layer first.
Thanks, I can investigate this direction. But seems this way still seems not the best way as ideally the change should be made in ParquetThriftTypes.h and .cpp |
@majetideepak Would you please elaborate more on this? Thank you. |
3f392dd to
d89f8fe
Compare
|
@majetideepak I added |
|
@PingLiuPing We should not be modifying the |
majetideepak
left a comment
There was a problem hiding this comment.
@PingLiuPing How do other tools like DuckDB write iceberg files without this NaN statistics present in the parquet.thrift file?
The idea is to have a separate field in the |
@majetideepak Thanks. For DuckDB-iceberg, they do not write this stat yet. They also do not support other stats at the moment. They have a PR to collect some of the stats (not NaN) duckdb/duckdb-iceberg#640. |
Ok, I will try to see if this is possible.
The stats are not used inside velox but it should be collected and saved to iceberg manifest file. It is used by upstream engine when planning. |
I am reading this as the coordinator requires these stats for better planning. The NaN stats are not written to Parquet but to an Iceberg manifest file. We should aim to store these stats separately as they are not related to Parquet anyway. |
Yes, with the current paquet.thrift, NaN is not write to parquet file. Once https://github.com/apache/parquet-format/pull/514/files#top is merged and Velox parquet writer is upgraded NaN will be wrote to parquet file. My initial design is just collect the NaN stats and report it back to coordinator node, and do not write NaN to parquet file. |
These files will anyway go away with this PR #14942 |
Oh, thanks. Let me investigate if there are ways to do this without code change in ParquetThriftTypes.h and ParquetThriftTypes.cpp |
d89f8fe to
eecbff4
Compare
|
@majetideepak I re-factored the code and removed all the changes relate to |
|
@majetideepak Could you please have a look? Thank you. |
1 similar comment
|
@majetideepak Could you please have a look? Thank you. |
eecbff4 to
f876b2b
Compare
|
@rui-mo @mbasmanova Wondering if you could take a look at this PR when convenient. |
f876b2b to
e6bba87
Compare
e6bba87 to
34b4af9
Compare
mbasmanova
left a comment
There was a problem hiding this comment.
@PingLiuPing Thanks.
It is hard to read this code because coding style is inconsistent and so different from Velox. If you plan to continue to develop this code, please, consider updating Parquet writer wholesale to follow Velox coding style?
CC: @pedroerp
@mbasmanova Thank you so much for reviewing. That makes sense. I’ll open an issue to track this and draft a plan to the Parquet writer to better align with Velox coding style. |
@PingLiuPing Thank you. I expect this can be done using an LLM relatively quickly. |
|
@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this in D92527330. |
|
@xiaoxmeng merged this pull request in 8c1a8aa. |
Add NaN statistic to Parquet writer
This change introduces a NaN statistic in the Parquet writer. Unlike other statistics (e.g., null_count, distinct_count), the NaN statistic is not written to the Parquet file footer. It is only reported to the Parquet writer caller when needed, such as when writing Iceberg Parquet data files.