diff --git a/README.md b/README.md index df0ac73a..ae7272fb 100644 --- a/README.md +++ b/README.md @@ -155,40 +155,11 @@ documented in [LogicalTypes.md][logical-types]. [logical-types]: LogicalTypes.md ### Sort Order - Parquet stores min/max statistics at several levels (such as Column Chunk, -Column Index and Data Page). Comparison for values of a type obey the -following rules: - -1. Each logical type has a specified comparison order. If a column is - annotated with an unknown logical type, statistics may not be used - for pruning data. The sort order for logical types is documented in - the [LogicalTypes.md][logical-types] page. -2. For primitive types, the following rules apply: - - * BOOLEAN - false, true - * INT32, INT64 - Signed comparison. - * FLOAT, DOUBLE - Signed comparison with special handling of NaNs and - signed zeros. The details are documented in the - [Thrift definition](src/main/thrift/parquet.thrift) in the - `ColumnOrder` union. They are summarized here but the Thrift definition - is considered authoritative: - * NaNs should not be written to min or max statistics fields. - * If the computed max value is zero (whether negative or positive), - `+0.0` should be written into the max statistics field. - * If the computed min value is zero (whether negative or positive), - `-0.0` should be written into the min statistics field. - - For backwards compatibility when reading files: - * If the min is a NaN, it should be ignored. - * If the max is a NaN, it should be ignored. - * If the min is +0, the row group may contain -0 values as well. - * If the max is -0, the row group may contain +0 values as well. - * When looking for NaN values, min and max should be ignored. - - * BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY - Lexicographic unsigned byte-wise - comparison. - +Column Index, and Data Page). These statistics are according to a sort order, +which is defined for each column in the file footer. Parquet supports common +sort orders for logical and primitve types. The details are documented in the +[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union. ## Nested Encoding To encode nested columns, Parquet uses the Dremel encoding with definition and diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index ff327170..59ec5f17 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -313,12 +313,12 @@ struct Statistics { /** Empty structs to use as logical type annotations */ struct StringType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 -struct UUIDType {} // allowed for FIXED[16], must encoded raw UUID bytes +struct UUIDType {} // allowed for FIXED[16], must be encoded as raw UUID bytes struct MapType {} // see LogicalTypes.md struct ListType {} // see LogicalTypes.md struct EnumType {} // allowed for BYTE_ARRAY, must be encoded with UTF-8 struct DateType {} // allowed for INT32 -struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes +struct Float16Type {} // allowed for FIXED[2], must be encoded as raw FLOAT16 bytes (see LogicalTypes.md) /** * Logical type to annotate a column that is always null. @@ -1057,6 +1057,7 @@ union ColumnOrder { * UINT64 - unsigned comparison * DECIMAL - signed comparison of the represented value * DATE - signed comparison + * FLOAT16 - signed comparison of the represented value (*) * TIME_MILLIS - signed comparison * TIME_MICROS - signed comparison * TIMESTAMP_MILLIS - signed comparison