-
Notifications
You must be signed in to change notification settings - Fork 486
PARQUET-2249: Add nan_count to handle NaNs in statistics #196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -163,18 +163,25 @@ following rules: | |
| [Thrift definition](src/main/thrift/parquet.thrift) in the | ||
| `ColumnOrder` union. They are summarized here but the Thrift definition | ||
| is considered authoritative: | ||
| * NaNs should not be written to min or max statistics fields. | ||
| * If the computed max value is zero (whether negative or positive), | ||
| `+0.0` should be written into the max statistics field. | ||
| * If the computed min value is zero (whether negative or positive), | ||
| `-0.0` should be written into the min statistics field. | ||
|
|
||
| For backwards compatibility when reading files: | ||
| * If the min is a NaN, it should be ignored. | ||
| * If the max is a NaN, it should be ignored. | ||
| * If the min is +0, the row group may contain -0 values as well. | ||
| * If the max is -0, the row group may contain +0 values as well. | ||
| * When looking for NaN values, min and max should be ignored. | ||
| * The following compatibility rules should be applied when reading statistics: | ||
|
JFinis marked this conversation as resolved.
Outdated
|
||
| * If the nan_count field is set to > 0 and both min and max are | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems it's a little strict here? Just ingore min-max seems ok?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mapleFU To your general comment (I can't answer there)
The problem is that the ColumnIndex does not have the So I see we have the following options:
Which one would you suggest here?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To this suggestion:
Note that the line you mentioned here just tells a reader that they can rely on this information, and therfore could, e.g., skip this page if a predicate like I guess this is related to your general suggestion: How do we detect only-NaN pages? Depending on what we do for that, this line will be adapted accordingly.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TBH: I would actually love to have a I just didn't want to suggest adding another list to each column index for the added space cost. However, given that these indexes are negligibly small in comparison to the data, I think actually no one would mind that extra space. If the consensus is that this is preferrable, I'm happy to adapt the commit to that.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I got it, I think using both min-max is backward-capatible and can represent "all-data-is-nan". https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L944 can we import a status like that?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mapleFU Yes, we could also add a My gut feeling is that one day having a But yes, we could drop the testing of Note though that if we then also drop that we write
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, maybe you are right. My point is that, if we write nan_count or even record count, the program would works well. However, non-float point page would have some size-overhead. Personally, I'd like to use |
||
| NaN, a reader can rely on that all non-NULL values are NaN | ||
| * Otherwise, if the min or the max is a NaN, it should be ignored. | ||
| * When looking for NaN values, min and max should be ignored; | ||
| if the nan_count field is set, it should be used to check whether | ||
| NaNs are present. | ||
| * If the min is +0, the row group may contain -0 values as well. | ||
| * If the max is -0, the row group may contain +0 values as well. | ||
| * When writing statistics the following rules should be followed: | ||
| * The nan_count fields should always be set for FLOAT and DOUBLE columns. | ||
| * NaNs should not be written to min or max statistics fields except | ||
| when all non-NULL values are NaN, in which case min and max should | ||
| both be written as NaN. If the nan_count field is set, this semantics | ||
| is mandated and readers may rely on it. | ||
| * If the computed max value is zero (whether negative or positive), | ||
| `+0.0` should be written into the max statistics field. | ||
| * If the computed min value is zero (whether negative or positive), | ||
| `-0.0` should be written into the min statistics field. | ||
|
|
||
| * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise | ||
| comparison. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -211,7 +211,7 @@ struct Statistics { | |
| */ | ||
| 1: optional binary max; | ||
| 2: optional binary min; | ||
| /** count of null value in the column */ | ||
| /** count of null values in the column */ | ||
| 3: optional i64 null_count; | ||
| /** count of distinct values occurring */ | ||
| 4: optional i64 distinct_count; | ||
|
|
@@ -223,6 +223,8 @@ struct Statistics { | |
| */ | ||
| 5: optional binary max_value; | ||
| 6: optional binary min_value; | ||
| /** count of NaN values in the column; only present if type is FLOAT or DOUBLE */ | ||
|
JFinis marked this conversation as resolved.
Outdated
|
||
| 7: optional i64 nan_count; | ||
| } | ||
|
|
||
| /** Empty structs to use as logical type annotations */ | ||
|
|
@@ -886,16 +888,23 @@ union ColumnOrder { | |
| * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison | ||
| * | ||
| * (*) Because the sorting order is not specified properly for floating | ||
| * point values (relations vs. total ordering) the following | ||
| * compatibility rules should be applied when reading statistics: | ||
| * - If the min is a NaN, it should be ignored. | ||
| * - If the max is a NaN, it should be ignored. | ||
| * point values (relations vs. total ordering), the following compatibility | ||
| * rules should be applied when reading statistics: | ||
| * - If the nan_count field is set to > 0 and both min and max are | ||
| * NaN, a reader can rely on that all non-NULL values are NaN | ||
| * - Otherwise, if the min or the max is a NaN, it should be ignored. | ||
| * - When looking for NaN values, min and max should be ignored; | ||
| * if the nan_count field is set, it can be used to check whether | ||
| * NaNs are present. | ||
| * - If the min is +0, the row group may contain -0 values as well. | ||
| * - If the max is -0, the row group may contain +0 values as well. | ||
| * - When looking for NaN values, min and max should be ignored. | ||
| * | ||
| * When writing statistics the following rules should be followed: | ||
| * - NaNs should not be written to min or max statistics fields. | ||
| * - The nan_count fields should always be set for FLOAT and DOUBLE columns. | ||
|
JFinis marked this conversation as resolved.
Outdated
|
||
| * - NaNs should not be written to min or max statistics fields except | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect to explicitly state that
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll update this with my next revision once we have decided on this issue. |
||
| * when all non-NULL values are NaN, in which case min and max should | ||
| * both be written as NaN. If the nan_count field is set, this semantics | ||
| * is mandated and readers may rely on it. | ||
| * - If the computed max value is zero (whether negative or positive), | ||
| * `+0.0` should be written into the max statistics field. | ||
| * - If the computed min value is zero (whether negative or positive), | ||
|
|
@@ -952,6 +961,9 @@ struct ColumnIndex { | |
| * Such more compact values must still be valid values within the column's | ||
| * logical type. Readers must make sure that list entries are populated before | ||
| * using them by inspecting null_pages. | ||
| * For columns of type FLOAT and DOUBLE, NaN values are not to be included | ||
|
wgtmac marked this conversation as resolved.
|
||
| * in these bounds unless all non-null values in a page are NaN, in which | ||
| * case min and max are to be set to NaN. | ||
| */ | ||
| 2: required list<binary> min_values | ||
| 3: required list<binary> max_values | ||
|
|
@@ -966,6 +978,10 @@ struct ColumnIndex { | |
|
|
||
| /** A list containing the number of null values for each page **/ | ||
| 5: optional list<i64> null_counts | ||
|
|
||
| /** A list containing the number of NaN values for each page. Only present | ||
| * for columns of type FLOAT and DOUBLE. **/ | ||
| 6: optional list<i64> nan_counts | ||
| } | ||
|
|
||
| struct AesGcmV1 { | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.