Skip to content

feat: Add iceberg data file statistics#17388

Closed
mohsaka wants to merge 4 commits into
facebookincubator:mainfrom
mohsaka:add-stats
Closed

feat: Add iceberg data file statistics#17388
mohsaka wants to merge 4 commits into
facebookincubator:mainfrom
mohsaka:add-stats

Conversation

@mohsaka
Copy link
Copy Markdown
Collaborator

@mohsaka mohsaka commented Apr 30, 2026

Adds Iceberg data file statistics collection (column sizes, value counts, null/NaN counts, min/max bounds) to the Parquet writer path. Statistics are aggregated from Parquet row group metadata after each file is closed and included in the Iceberg commit message.

Re-lands #16062 and #16867, which were reverted in #16999 due to roundUpUtf8 entering an infinite loop on non-UTF8 varbinary data. The fix adds roundUpBinary for computing upper bounds on raw binary data (follows Iceberg's BinaryUtil.truncateBinaryMax()). Statistics.cpp now dispatches to roundUpUtf8 for STRING or roundUpBinary for BINARY based on the Parquet logical type.

Additional fixes:

  • Guard parquet stats collector initialization with a format check (only init for Parquet tables).
  • Use dynamic_pointer_cast instead of checkedPointerCast for writer options.
  • Use toManifestFormatString() instead of hardcoded "PARQUET".
  • Emit empty stats for non-Parquet formats.

Added a roundUpBinary function based off of iceberg's truncateBinaryMax taken from:
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/util/BinaryUtil.java

Prestissimo CI/CD:
prestodb/presto#27702

Some additional comments from the original backed out PRs:
Writer close function now returns an optional virtual std::unique_ptr<FileMetadata>. When the code got reverted we were already at FileMetadata:
#16801

However, when it was reverted, we switched back to void. We now switch back once again to FileMetadata.

@netlify
Copy link
Copy Markdown

netlify Bot commented Apr 30, 2026

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b39bf81
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/6a068803993e810008bd4048

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 30, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Build Impact Analysis

Selective Build Targets (building these covers all 511 affected)

cmake --build _build/release --target aggregate_companion_functions_test bias_vector_test cached_factory_test copy_benchmark file_utils_test physical_size_aggregator_test presto_sql_test simple_lru_cache_test simple_vector_test spark_aggregation_fuzzer_test spark_expression_fuzzer_test velox_abfs_test velox_aggregates_GeometryAggregateTest velox_aggregates_reduce_agg_bm velox_aggregates_simple_aggregates_bm velox_aggregates_string_keys_bm velox_aggregates_test_group0 velox_aggregates_test_group1 velox_aggregates_test_group2 velox_aggregates_test_group3 velox_aggregates_test_group4 velox_aggregation_fuzzer_test velox_aggregation_runner_test velox_arrow_bridge_test velox_base_test velox_benchmark_array_writer_no_nulls velox_benchmark_array_writer_with_nulls velox_benchmark_basic_comparison_conjunct velox_benchmark_basic_decoded_vector velox_benchmark_basic_preproc velox_benchmark_basic_selectivity_vector velox_benchmark_basic_simple_arithmetic velox_benchmark_basic_simple_cast velox_benchmark_basic_vector_compare velox_benchmark_basic_vector_fuzzer velox_benchmark_basic_vector_slice velox_benchmark_estimate_flat_size velox_benchmark_expr_flat_no_nulls velox_benchmark_feature_normalization velox_benchmark_map_writer_no_nulls velox_benchmark_map_writer_with_nulls velox_benchmark_nested_array_writer_no_nulls velox_benchmark_nested_array_writer_with_nulls velox_buffer_test velox_cache_fuzzer velox_cache_test_group0 velox_cast_benchmark velox_common_base_benchmarks velox_common_compression_test velox_common_encode_test velox_common_geospatial_serde_test velox_common_hyperloglog_dense_hll_bm velox_common_hyperloglog_test velox_common_indexed_priority_queue_benchmark velox_common_sorting_network_benchmark velox_common_split_block_bloom_filter_benchmark velox_common_stringsearch_benchmarks velox_common_test velox_concurrent_allocation_benchmark velox_config_test velox_connector_test velox_constrained_input_generators_test velox_constrained_vector_generator_test velox_core_plan_consistency_checker_test velox_core_test velox_date_conversion_benchmark velox_date_extract_benchmark velox_demo_rpc_function_test velox_driver_test velox_duckdb_conversion_test velox_dwio_arrow_parquet_writer_test velox_dwio_cache_test velox_dwio_common_bitpack_decoder_benchmark velox_dwio_common_data_buffer_benchmark velox_dwio_common_int_decoder_benchmark velox_dwio_common_test velox_dwio_dwrf_buffered_output_stream_test velox_dwio_dwrf_byte_rle_encoder_test velox_dwio_dwrf_byte_rle_test velox_dwio_dwrf_checksum_test velox_dwio_dwrf_column_reader_test velox_dwio_dwrf_column_statistics_test velox_dwio_dwrf_compression_test velox_dwio_dwrf_config_test velox_dwio_dwrf_data_buffer_holder_test velox_dwio_dwrf_decompression_test velox_dwio_dwrf_decryption_test velox_dwio_dwrf_dictionary_encoder_test velox_dwio_dwrf_dictionary_encoding_utils_test velox_dwio_dwrf_encoding_selector_test velox_dwio_dwrf_encryption_test velox_dwio_dwrf_flush_policy_test velox_dwio_dwrf_index_builder_test velox_dwio_dwrf_int_direct_test velox_dwio_dwrf_int_encoder_test velox_dwio_dwrf_layout_planner_test velox_dwio_dwrf_ratio_checker_test velox_dwio_dwrf_reader_base_test velox_dwio_dwrf_reader_test velox_dwio_dwrf_rle_test velox_dwio_dwrf_rlev1_encoder_test velox_dwio_dwrf_stream_labels_test velox_dwio_dwrf_stripe_dictionary_cache_test velox_dwio_dwrf_stripe_reader_base_test velox_dwio_dwrf_stripe_stream_test velox_dwio_dwrf_utils_test velox_dwio_dwrf_writer_context_test velox_dwio_dwrf_writer_encoding_manager_test velox_dwio_dwrf_writer_sink_test velox_dwio_dwrf_writer_test velox_dwio_iceberg_reader_benchmark velox_dwio_orc_column_statistics_test velox_dwio_orc_reader_filter_test velox_dwio_orc_reader_test velox_dwio_parquet_common_test velox_dwio_parquet_page_reader_test velox_dwio_parquet_reader_benchmark velox_dwio_parquet_reader_test velox_dwio_parquet_rlebp_decoder_test velox_dwio_parquet_structure_decoder_benchmark velox_dwio_parquet_structure_decoder_test velox_dwio_parquet_table_scan_test velox_dwio_parquet_thrift_test velox_dwio_parquet_tpch_test velox_dwrf_column_writer_index_test velox_dwrf_column_writer_stats_test velox_dwrf_column_writer_test velox_dwrf_e2e_filter_test velox_dwrf_e2e_reader_test velox_dwrf_e2e_writer_test velox_dwrf_float_column_writer_benchmark velox_dwrf_int_encoder_benchmark velox_dwrf_statistics_builder_utils_test velox_dwrf_writer_extended_test velox_dwrf_writer_flush_test velox_example_expression_eval velox_example_opaque_type velox_example_operator_extensibility velox_example_scan_orc velox_example_simple_functions velox_example_vector_reader_writer velox_exchange_benchmark velox_exchange_fuzzer velox_exec_SpatialJoinTest velox_exec_bm_duplicate_project velox_exec_infra_test velox_exec_prefixsort_test velox_exec_test_group0 velox_exec_test_group1 velox_exec_test_group2 velox_exec_test_group3 velox_exec_test_group4 velox_exec_test_group5 velox_exec_test_group6 velox_exec_test_group7 velox_exec_util_test_group0 velox_exec_vector_hasher_benchmark velox_expression_fuzzer_test velox_expression_fuzzer_unit_test velox_expression_runner_test velox_expression_runner_unit_test velox_expression_test velox_expression_verifier_unit_test velox_file_test velox_filemetadata_test velox_filter_benchmark velox_filter_project_benchmark velox_format_datetime_benchmark velox_fragmentation_benchmark velox_function_dynamic_link_test velox_function_registry_test velox_functions_aggregates_test velox_functions_benchmarks_compare velox_functions_benchmarks_row_writer_no_nulls velox_functions_benchmarks_simdjson_function_with_expr velox_functions_benchmarks_string_writer_no_nulls velox_functions_benchmarks_url velox_functions_iceberg_test velox_functions_json_test velox_functions_lib_test velox_functions_prestosql_benchmarks_array_contains velox_functions_prestosql_benchmarks_array_min_max velox_functions_prestosql_benchmarks_array_position velox_functions_prestosql_benchmarks_array_sum velox_functions_prestosql_benchmarks_bitwise velox_functions_prestosql_benchmarks_cardinality velox_functions_prestosql_benchmarks_comparisons velox_functions_prestosql_benchmarks_concat velox_functions_prestosql_benchmarks_date_time velox_functions_prestosql_benchmarks_field_reference velox_functions_prestosql_benchmarks_generic velox_functions_prestosql_benchmarks_in velox_functions_prestosql_benchmarks_map_concat velox_functions_prestosql_benchmarks_map_except velox_functions_prestosql_benchmarks_map_input velox_functions_prestosql_benchmarks_map_intersect velox_functions_prestosql_benchmarks_map_subscript velox_functions_prestosql_benchmarks_map_zip_with velox_functions_prestosql_benchmarks_not velox_functions_prestosql_benchmarks_regexp_replace velox_functions_prestosql_benchmarks_row velox_functions_prestosql_benchmarks_string_ascii_utf_functions velox_functions_prestosql_benchmarks_uuid_cast velox_functions_prestosql_benchmarks_width_bucket velox_functions_prestosql_benchmarks_zip velox_functions_prestosql_benchmarks_zip_with velox_functions_sfm_test velox_functions_spark_aggregates_test velox_functions_spark_test velox_functions_string_test velox_functions_test velox_fuzzer_connector_test velox_gcs_file_test velox_gcs_insert_test velox_gcs_multiendpoints_test velox_gcsfile_example velox_hash_benchmark velox_hash_join_build_benchmark velox_hash_join_list_result_benchmark velox_hash_join_prepare_join_table_benchmark velox_hdfs_file_test velox_hdfs_insert_test velox_hierarchical_timer_test velox_hive_connector_test velox_hive_iceberg_deletion_vector_test velox_hive_iceberg_deletion_vector_writer_test velox_hive_iceberg_dwrf_insert_test velox_hive_iceberg_equality_delete_test velox_hive_iceberg_insert_test velox_hive_iceberg_test velox_hive_paimon_connector velox_hive_paimon_data_file_meta_test velox_hive_paimon_deletion_file_test velox_hive_paimon_row_kind_test velox_hive_paimon_split_test velox_hive_partition_function_benchmark velox_hive_writer_options_adapter_test velox_id_map_test velox_in_10_min_demo velox_join_fuzzer velox_key_encoder_test velox_like_benchmark velox_like_tpch_benchmark velox_mark_distinct_fuzzer velox_mark_sorted_benchmark velox_memcpy_meter velox_memory_arbitration_fuzzer velox_memory_test velox_merge_benchmark velox_mock_rpc_client_test velox_negated_bigint_range_benchmark velox_negated_bytes_range_benchmark velox_negated_bytes_values_benchmark velox_negated_values_filter_benchmark velox_numeric_upcast_benchmark velox_orderby_benchmark velox_parquet_e2e_filter_test velox_parquet_writer_sink_test velox_parquet_writer_test velox_parse_test velox_prefix_sort_algorithm_benchmark velox_prefixsort_benchmark velox_presto_type_parser_test velox_presto_types_fuzzer_utils_test velox_presto_types_test velox_prestosql_coverage velox_process_test velox_query_config_provider_test velox_query_replayer velox_re2_functions_benchmarks velox_read_benchmark velox_row_number_fuzzer velox_row_serializer_benchmark velox_row_test velox_rpc_node_test velox_rpc_operator_test velox_rpc_state_test velox_s3config_test velox_s3file_test velox_s3finalize_test velox_s3insert_test velox_s3metrics_test velox_s3multiendpoints_test velox_s3read_test velox_s3registration_test velox_scoped_registry_test velox_serialization_test velox_serializer_benchmark velox_serializer_test_group0 velox_signature_parser_test velox_simple_aggregate_test velox_sort_benchmark velox_spark_function_registry_test velox_spark_query_runner_test velox_spark_windows_test velox_sparksql_benchmarks_cast velox_sparksql_benchmarks_compare velox_sparksql_benchmarks_from_json velox_sparksql_benchmarks_get_funcs velox_sparksql_benchmarks_hash velox_sparksql_benchmarks_in velox_sparksql_benchmarks_simd_compare velox_sparksql_benchmarks_split velox_sparksql_coverage velox_spatial_join_benchmark velox_spatial_join_fuzzer velox_spiller_aggregate_benchmark velox_spiller_join_benchmark velox_streaming_aggregation_benchmark velox_string_core_benchmark velox_string_view_benchmark velox_table_evolution_fuzzer_test velox_test_util_test velox_text_reader_test velox_text_writer_test velox_time_test velox_tool_trace_test velox_topn_row_number_fuzzer velox_tpcds_benchmark velox_tpcds_connector_test velox_tpcds_gen_test velox_tpch_benchmark velox_tpch_connector_test velox_tpch_gen_test velox_tpch_speed_test velox_trace_file_tool velox_type_fbhive_test velox_type_serializer_fbhive_test velox_type_test velox_type_tz_ext_invalid_test velox_type_tz_test velox_unsafe_row_serialize_benchmark velox_vector_fuzzer_test velox_vector_hash_all_benchmark velox_vector_map_update_benchmark velox_vector_selectivity_vector_benchmark velox_vector_test velox_wave_benchmark velox_wave_common_test velox_wave_decode_test velox_wave_exec_test velox_window_fuzzer_test velox_window_prefixsort_benchmark velox_window_sub_partitioned_sort_benchmark velox_windows_agg_test velox_windows_rank_test velox_windows_value_test velox_writer_fuzzer_test

Total affected: 511/575 targets

Warning: 4 file(s) could not be mapped to any target. A full build may be needed.

  • velox/connectors/hive/iceberg/CMakeLists.txt
  • velox/connectors/hive/iceberg/tests/CMakeLists.txt
  • velox/dwio/parquet/writer/arrow/CMakeLists.txt
  • velox/dwio/parquet/writer/arrow/tests/CMakeLists.txt
Affected targets (511)

Directly changed (18)

Target Changed Files
velox_dwio_arrow_parquet_writer_lib Statistics.cpp, StringImpl.h, StringTruncation.cpp, StringTruncation.h
velox_dwio_arrow_parquet_writer_test StatisticsTest.cpp, StringTruncation.h, StringTruncationTest.cpp
velox_dwio_dwrf_reader StringImpl.h
velox_dwio_native_parquet_reader StringImpl.h
velox_expression StringImpl.h
velox_functions_iceberg StringImpl.h
velox_functions_lib StringImpl.h
velox_functions_prestosql StringImpl.h
velox_functions_prestosql_impl StringImpl.h
velox_functions_spark StringImpl.h
velox_functions_spark_impl StringImpl.h
velox_functions_spark_specialforms StringImpl.h
velox_functions_string_test StringImpl.h, StringImplTest.cpp
velox_functions_test StringImpl.h
velox_hive_iceberg_dwrf_insert_test IcebergDataFileStatistics.h, IcebergDataSink.h, IcebergParquetStatsCollector.h
velox_hive_iceberg_insert_test IcebergDataFileStatistics.h, IcebergDataSink.h, IcebergParquetStatsCollector.h, IcebergParquetStatsTest.cpp
velox_hive_iceberg_splitreader IcebergDataFileStatistics.cpp, IcebergDataFileStatistics.h, IcebergDataSink.cpp, IcebergDataSink.h, IcebergParquetStatsCollector.cpp, ... (+1 more)
velox_type StringImpl.h

Transitively affected (493)

  • aggregate_companion_functions_test
  • bias_vector_test
  • cached_factory_test
  • copy_benchmark
  • file_utils_test
  • physical_size_aggregator_test
  • presto_sql_test
  • simple_lru_cache_test
  • simple_vector_test
  • spark_aggregation_fuzzer_test
  • spark_expression_fuzzer_test
  • velox_abfs
  • velox_abfs_test
  • velox_aggregates
  • velox_aggregates_GeometryAggregateTest
  • velox_aggregates_reduce_agg_bm
  • velox_aggregates_simple_aggregates_bm
  • velox_aggregates_string_keys_bm
  • velox_aggregates_test_group0
  • velox_aggregates_test_group1
  • velox_aggregates_test_group2
  • velox_aggregates_test_group3
  • velox_aggregates_test_group4
  • velox_aggregation_fuzzer
  • velox_aggregation_fuzzer_base
  • velox_aggregation_fuzzer_test
  • velox_aggregation_result_verifier
  • velox_aggregation_runner_test
  • velox_arrow_bridge
  • velox_arrow_bridge_test
  • velox_async_rpc_function_registry
  • velox_base_test
  • velox_benchmark_array_writer_no_nulls
  • velox_benchmark_array_writer_with_nulls
  • velox_benchmark_basic_comparison_conjunct
  • velox_benchmark_basic_decoded_vector
  • velox_benchmark_basic_preproc
  • velox_benchmark_basic_selectivity_vector
  • velox_benchmark_basic_simple_arithmetic
  • velox_benchmark_basic_simple_cast
  • velox_benchmark_basic_vector_compare
  • velox_benchmark_basic_vector_fuzzer
  • velox_benchmark_basic_vector_slice
  • velox_benchmark_builder
  • velox_benchmark_estimate_flat_size
  • velox_benchmark_expr_flat_no_nulls
  • velox_benchmark_feature_normalization
  • velox_benchmark_map_writer_no_nulls
  • velox_benchmark_map_writer_with_nulls
  • velox_benchmark_nested_array_writer_no_nulls
  • velox_benchmark_nested_array_writer_with_nulls
  • velox_buffer
  • velox_buffer_test
  • velox_cache_fuzzer
  • velox_cache_fuzzer_lib
  • velox_cache_test_group0
  • velox_caching
  • velox_cast_benchmark
  • velox_common_base
  • velox_common_base_benchmarks
  • velox_common_compression
  • velox_common_compression_test
  • velox_common_config
  • velox_common_encode_test
  • velox_common_fuzzer_util
  • velox_common_geospatial_serde
  • velox_common_geospatial_serde_test
  • velox_common_hyperloglog
  • velox_common_hyperloglog_dense_hll_bm
  • velox_common_hyperloglog_test
  • velox_common_indexed_priority_queue_benchmark
  • velox_common_sorting_network_benchmark
  • velox_common_split_block_bloom_filter_benchmark
  • velox_common_stringsearch_benchmarks
  • velox_common_test
  • velox_concurrent_allocation_benchmark
  • velox_config_property
  • velox_config_test
  • velox_connector
  • velox_connector_registry
  • velox_connector_test
  • velox_constrained_input_generators
  • velox_constrained_input_generators_test
  • velox_constrained_vector_generator
  • velox_constrained_vector_generator_test
  • velox_core
  • velox_core_plan_consistency_checker_test
  • velox_core_test
  • velox_coverage_util
  • velox_cursor
  • velox_date_conversion_benchmark
  • velox_date_extract_benchmark
  • velox_demo_rpc_function
  • velox_demo_rpc_function_test
  • velox_driver_test
  • velox_duckdb_conversion
  • velox_duckdb_conversion_test
  • velox_duckdb_parser
  • velox_dwio_arrow_parquet_writer
  • velox_dwio_arrow_parquet_writer_test_lib
  • velox_dwio_arrow_parquet_writer_util_lib
  • velox_dwio_cache_test
  • velox_dwio_catalog_fbhive
  • velox_dwio_common
  • velox_dwio_common_bitpack_decoder_benchmark
  • velox_dwio_common_compression
  • velox_dwio_common_data_buffer_benchmark
  • velox_dwio_common_exception
  • velox_dwio_common_int_decoder_benchmark
  • velox_dwio_common_test
  • velox_dwio_common_test_utils
  • velox_dwio_dwrf_buffered_output_stream_test
  • velox_dwio_dwrf_byte_rle_encoder_test
  • velox_dwio_dwrf_byte_rle_test
  • velox_dwio_dwrf_checksum_test
  • velox_dwio_dwrf_column_reader_test
  • velox_dwio_dwrf_column_statistics_test
  • velox_dwio_dwrf_common
  • velox_dwio_dwrf_compression_test
  • velox_dwio_dwrf_config_test
  • velox_dwio_dwrf_data_buffer_holder_test
  • velox_dwio_dwrf_decompression_test
  • velox_dwio_dwrf_decryption_test
  • velox_dwio_dwrf_dictionary_encoder_test
  • velox_dwio_dwrf_dictionary_encoding_utils_test
  • velox_dwio_dwrf_encoding_selector_test
  • velox_dwio_dwrf_encryption_test
  • velox_dwio_dwrf_flush_policy_test
  • velox_dwio_dwrf_index_builder_test
  • velox_dwio_dwrf_int_direct_test
  • velox_dwio_dwrf_int_encoder_test
  • velox_dwio_dwrf_layout_planner_test
  • velox_dwio_dwrf_ratio_checker_test
  • velox_dwio_dwrf_reader_base_test
  • velox_dwio_dwrf_reader_test
  • velox_dwio_dwrf_rle_test
  • velox_dwio_dwrf_rlev1_encoder_test
  • velox_dwio_dwrf_stream_labels_test
  • velox_dwio_dwrf_stripe_dictionary_cache_test
  • velox_dwio_dwrf_stripe_reader_base_test
  • velox_dwio_dwrf_stripe_stream_test
  • velox_dwio_dwrf_utils
  • velox_dwio_dwrf_utils_test
  • velox_dwio_dwrf_writer
  • velox_dwio_dwrf_writer_context_test
  • velox_dwio_dwrf_writer_encoding_manager_test
  • velox_dwio_dwrf_writer_sink_test
  • velox_dwio_dwrf_writer_test
  • velox_dwio_faulty_file_sink
  • velox_dwio_iceberg_reader_benchmark
  • velox_dwio_iceberg_reader_benchmark_lib
  • velox_dwio_orc_column_statistics_test
  • velox_dwio_orc_reader
  • velox_dwio_orc_reader_filter_test
  • velox_dwio_orc_reader_test
  • velox_dwio_parquet_common
  • velox_dwio_parquet_common_test
  • velox_dwio_parquet_page_reader_test
  • velox_dwio_parquet_reader
  • velox_dwio_parquet_reader_benchmark
  • velox_dwio_parquet_reader_benchmark_lib
  • velox_dwio_parquet_reader_test
  • velox_dwio_parquet_rlebp_decoder_test
  • velox_dwio_parquet_structure_decoder_benchmark
  • velox_dwio_parquet_structure_decoder_test
  • velox_dwio_parquet_table_scan_test
  • velox_dwio_parquet_thrift_test
  • velox_dwio_parquet_tpch_test
  • velox_dwio_parquet_writer
  • velox_dwio_text_reader
  • velox_dwio_text_reader_register
  • velox_dwio_text_writer
  • velox_dwio_text_writer_register
  • velox_dwrf_column_writer_index_test
  • velox_dwrf_column_writer_stats_test
  • velox_dwrf_column_writer_test
  • velox_dwrf_e2e_filter_test
  • velox_dwrf_e2e_reader_test
  • velox_dwrf_e2e_writer_test
  • velox_dwrf_float_column_writer_benchmark
  • velox_dwrf_int_encoder_benchmark
  • velox_dwrf_statistics_builder_utils_test
  • velox_dwrf_test_utils
  • velox_dwrf_writer_extended_test
  • velox_dwrf_writer_flush_test
  • velox_dynamic_library_loader
  • velox_example_expression_eval
  • velox_example_opaque_type
  • velox_example_operator_extensibility
  • velox_example_scan_orc
  • velox_example_simple_functions
  • velox_example_vector_reader_writer
  • velox_exception
  • velox_exchange_benchmark
  • velox_exchange_fuzzer
  • velox_exec
  • velox_exec_SpatialJoinTest
  • velox_exec_bm_duplicate_project
  • velox_exec_infra_test
  • velox_exec_prefixsort_test
  • velox_exec_prefixsort_test_lib
  • velox_exec_spill_stats
  • velox_exec_test_group0
  • velox_exec_test_group1
  • velox_exec_test_group2
  • velox_exec_test_group3
  • velox_exec_test_group4
  • velox_exec_test_group5
  • velox_exec_test_group6
  • velox_exec_test_group7
  • velox_exec_test_lib
  • velox_exec_util_test_group0
  • velox_exec_vector_hasher_benchmark
  • velox_expression_functions
  • velox_expression_fuzzer
  • velox_expression_fuzzer_test
  • velox_expression_fuzzer_test_utility
  • velox_expression_fuzzer_unit_test
  • velox_expression_runner
  • velox_expression_runner_test
  • velox_expression_runner_unit_test
  • velox_expression_test
  • velox_expression_test_utility
  • velox_expression_verifier
  • velox_expression_verifier_unit_test
  • velox_file
  • velox_file_test
  • velox_file_test_utils
  • velox_filemetadata_test
  • velox_filter_benchmark
  • velox_filter_project_benchmark
  • velox_format_datetime_benchmark
  • velox_fragmentation_benchmark
  • velox_function_dynamic_link_test
  • velox_function_registry
  • velox_function_registry_test
  • velox_functions_aggregates
  • velox_functions_aggregates_test
  • velox_functions_aggregates_test_lib
  • velox_functions_benchmarks_compare
  • velox_functions_benchmarks_row_writer_no_nulls
  • velox_functions_benchmarks_simdjson_function_with_expr
  • velox_functions_benchmarks_string_writer_no_nulls
  • velox_functions_benchmarks_url
  • velox_functions_geo
  • velox_functions_iceberg_hash
  • velox_functions_iceberg_test
  • velox_functions_json
  • velox_functions_json_test
  • velox_functions_lib_date_time_formatter
  • velox_functions_lib_test
  • velox_functions_prestosql_benchmarks_array_contains
  • velox_functions_prestosql_benchmarks_array_min_max
  • velox_functions_prestosql_benchmarks_array_position
  • velox_functions_prestosql_benchmarks_array_sum
  • velox_functions_prestosql_benchmarks_bitwise
  • velox_functions_prestosql_benchmarks_cardinality
  • velox_functions_prestosql_benchmarks_comparisons
  • velox_functions_prestosql_benchmarks_concat
  • velox_functions_prestosql_benchmarks_date_time
  • velox_functions_prestosql_benchmarks_field_reference
  • velox_functions_prestosql_benchmarks_generic
  • velox_functions_prestosql_benchmarks_in
  • velox_functions_prestosql_benchmarks_map_concat
  • velox_functions_prestosql_benchmarks_map_except
  • velox_functions_prestosql_benchmarks_map_input
  • velox_functions_prestosql_benchmarks_map_intersect
  • velox_functions_prestosql_benchmarks_map_subscript
  • velox_functions_prestosql_benchmarks_map_zip_with
  • velox_functions_prestosql_benchmarks_not
  • velox_functions_prestosql_benchmarks_regexp_replace
  • velox_functions_prestosql_benchmarks_row
  • velox_functions_prestosql_benchmarks_string_ascii_utf_functions
  • velox_functions_prestosql_benchmarks_uuid_cast
  • velox_functions_prestosql_benchmarks_width_bucket
  • velox_functions_prestosql_benchmarks_zip
  • velox_functions_prestosql_benchmarks_zip_with
  • velox_functions_sfm
  • velox_functions_sfm_test
  • velox_functions_spark_aggregates
  • velox_functions_spark_aggregates_test
  • velox_functions_spark_test
  • velox_functions_spark_window
  • velox_functions_test_lib
  • velox_functions_util
  • velox_functions_window
  • velox_functions_window_test_lib
  • velox_fuzzer_connector
  • velox_fuzzer_connector_test
  • velox_fuzzer_util
  • velox_gcs
  • velox_gcs_file_test
  • velox_gcs_insert_test
  • velox_gcs_multiendpoints_test
  • velox_gcsfile_example
  • velox_hash_benchmark
  • velox_hash_join_build_benchmark
  • velox_hash_join_list_result_benchmark
  • velox_hash_join_prepare_join_table_benchmark
  • velox_hdfs
  • velox_hdfs_file_test
  • velox_hdfs_insert_test
  • velox_hierarchical_timer
  • velox_hierarchical_timer_test
  • velox_hive_config
  • velox_hive_connector
  • velox_hive_connector_test
  • velox_hive_iceberg_deletion_vector_test
  • velox_hive_iceberg_deletion_vector_writer_test
  • velox_hive_iceberg_equality_delete_test
  • velox_hive_iceberg_test
  • velox_hive_paimon_connector
  • velox_hive_paimon_data_file_meta_test
  • velox_hive_paimon_deletion_file_test
  • velox_hive_paimon_row_kind_test
  • velox_hive_paimon_split
  • velox_hive_paimon_split_test
  • velox_hive_partition_function
  • velox_hive_partition_function_benchmark
  • velox_hive_writer_options_adapter_test
  • velox_id_map
  • velox_id_map_test
  • velox_in_10_min_demo
  • velox_is_null_functions
  • velox_join_fuzzer
  • velox_key_encoder
  • velox_key_encoder_test
  • velox_like_benchmark
  • velox_like_tpch_benchmark
  • velox_mark_distinct_fuzzer
  • velox_mark_distinct_fuzzer_lib
  • velox_mark_sorted_benchmark
  • velox_memcpy_meter
  • velox_memory
  • velox_memory_arbitration_fuzzer
  • velox_memory_test
  • velox_merge_benchmark
  • velox_mock_rpc_client
  • velox_mock_rpc_client_test
  • velox_negated_bigint_range_benchmark
  • velox_negated_bytes_range_benchmark
  • velox_negated_bytes_values_benchmark
  • velox_negated_values_filter_benchmark
  • velox_numeric_upcast_benchmark
  • velox_orderby_benchmark
  • velox_orderby_benchmark_util
  • velox_parquet_e2e_filter_test
  • velox_parquet_writer_sink_test
  • velox_parquet_writer_test
  • velox_parse_expression
  • velox_parse_parser
  • velox_parse_test
  • velox_parse_utils
  • velox_prefix_sort_algorithm_benchmark
  • velox_prefixsort_benchmark
  • velox_presto_serializer
  • velox_presto_type_parser
  • velox_presto_type_parser_test
  • velox_presto_types
  • velox_presto_types_fuzzer_utils
  • velox_presto_types_fuzzer_utils_test
  • velox_presto_types_test
  • velox_prestosql_coverage
  • velox_process
  • velox_process_test
  • velox_query_benchmark
  • velox_query_config_provider
  • velox_query_config_provider_test
  • velox_query_replayer
  • velox_query_trace_replayer_base
  • velox_re2_functions_benchmarks
  • velox_read_benchmark
  • velox_read_benchmark_lib
  • velox_row_fast
  • velox_row_number_fuzzer
  • velox_row_number_fuzzer_lib
  • velox_row_serializer_benchmark
  • velox_row_test
  • velox_rpc_function_stubs
  • velox_rpc_node_test
  • velox_rpc_operator
  • velox_rpc_operator_test
  • velox_rpc_plan_node_translator
  • velox_rpc_state
  • velox_rpc_state_test
  • velox_s3config_test
  • velox_s3file_test
  • velox_s3finalize_test
  • velox_s3fs
  • velox_s3insert_test
  • velox_s3metrics_test
  • velox_s3multiendpoints_test
  • velox_s3read_test
  • velox_s3registration_test
  • velox_scoped_registry_test
  • velox_serialization
  • velox_serialization_test
  • velox_serializer_benchmark
  • velox_serializer_test_group0
  • velox_signature_parser
  • velox_signature_parser_test
  • velox_simple_aggregate
  • velox_simple_aggregate_test
  • velox_sort_benchmark
  • velox_spark_function_registry_test
  • velox_spark_query_runner
  • velox_spark_query_runner_test
  • velox_spark_windows_test
  • velox_sparksql_benchmarks_cast
  • velox_sparksql_benchmarks_compare
  • velox_sparksql_benchmarks_from_json
  • velox_sparksql_benchmarks_get_funcs
  • velox_sparksql_benchmarks_hash
  • velox_sparksql_benchmarks_in
  • velox_sparksql_benchmarks_simd_compare
  • velox_sparksql_benchmarks_split
  • velox_sparksql_coverage
  • velox_spatial_join_benchmark
  • velox_spatial_join_fuzzer
  • velox_spill_fuzzer_base_lib
  • velox_spiller_aggregate_benchmark
  • velox_spiller_aggregate_benchmark_base
  • velox_spiller_join_benchmark
  • velox_spiller_join_benchmark_base
  • velox_streaming_aggregation_benchmark
  • velox_string_core_benchmark
  • velox_string_view_benchmark
  • velox_table_evolution_fuzzer_test
  • velox_test_util
  • velox_test_util_test
  • velox_text_reader_test
  • velox_text_writer_test
  • velox_time
  • velox_time_test
  • velox_tool_trace_test
  • velox_topn_row_number_fuzzer
  • velox_topn_row_number_fuzzer_lib
  • velox_tpcds_append_info
  • velox_tpcds_benchmark
  • velox_tpcds_benchmark_lib
  • velox_tpcds_connector
  • velox_tpcds_connector_test
  • velox_tpcds_dsdgen
  • velox_tpcds_gen
  • velox_tpcds_gen_test
  • velox_tpch_benchmark
  • velox_tpch_benchmark_lib
  • velox_tpch_connector
  • velox_tpch_connector_test
  • velox_tpch_gen
  • velox_tpch_gen_test
  • velox_tpch_speed_test
  • velox_trace
  • velox_trace_file_tool
  • velox_trace_file_tool_base
  • velox_type_calculation
  • velox_type_fbhive
  • velox_type_fbhive_test
  • velox_type_serializer_fbhive_test
  • velox_type_signature
  • velox_type_test
  • velox_type_tz_ext_invalid_test
  • velox_type_tz_test
  • velox_unsafe_row_serialize_benchmark
  • velox_vector
  • velox_vector_fuzzer
  • velox_vector_fuzzer_test
  • velox_vector_fuzzer_util
  • velox_vector_hash_all_benchmark
  • velox_vector_map_update_benchmark
  • velox_vector_selectivity_vector_benchmark
  • velox_vector_test
  • velox_vector_test_lib
  • velox_wave_benchmark
  • velox_wave_common
  • velox_wave_common_test
  • velox_wave_decode
  • velox_wave_decode_test
  • velox_wave_exec
  • velox_wave_exec_test
  • velox_wave_mock_file
  • velox_wave_mock_reader
  • velox_wave_vector
  • velox_window
  • velox_window_fuzzer
  • velox_window_fuzzer_test
  • velox_window_prefixsort_benchmark
  • velox_window_sub_partitioned_sort_benchmark
  • velox_windows_agg_test
  • velox_windows_rank_test
  • velox_windows_value_test
  • velox_writer_fuzzer
  • velox_writer_fuzzer_test

Slow path • Graph generated from PR branch

@mohsaka mohsaka changed the title Revert "refactor: Revert iceberg data file statistics changes and fixes. feat: Revert "refactor: Revert iceberg data file statistics changes" and fixes. Apr 30, 2026
@mohsaka mohsaka changed the title feat: Revert "refactor: Revert iceberg data file statistics changes" and fixes. feat: Revert refactor: Revert iceberg data file statistics changes and additional fixes. Apr 30, 2026
@mohsaka mohsaka changed the title feat: Revert refactor: Revert iceberg data file statistics changes and additional fixes. feat: Revert refactor Revert iceberg data file statistics changes and additional fixes. Apr 30, 2026
@mohsaka mohsaka changed the title feat: Revert refactor Revert iceberg data file statistics changes and additional fixes. feat: Revert "refactor: Revert iceberg data file statistics changes" and additional fixes. Apr 30, 2026
@mohsaka mohsaka changed the title feat: Revert "refactor: Revert iceberg data file statistics changes" and additional fixes. feat: Revert "refactor: Revert iceberg data file statistics changes" and additional fixes Apr 30, 2026
@mohsaka mohsaka force-pushed the add-stats branch 3 times, most recently from 25e85e8 to 434217a Compare May 1, 2026 00:11
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

CI Failure Analysis

Auto-generated by the CI Failure Analysis workflow. This comment is updated in place each time CI fails on a new commit, so it always reflects the latest run — re-pushing or re-running CI will refresh the analysis below. Last updated 2026-05-15 03:20:31 UTC from workflow run 25897353838.

🔴 Expression Fuzzer with Presto SOT — FUZZER Failure View logs

Failed instances: All 4 instances failed (seeds: 59238549, 932524077, 400533294, 254235930)

All instances failed with Velox and reference DB results don't match:

ExpressionVerifier.cpp:475, Function:verify
Expression: exec::test::assertEqualResults(
    referenceEvalResult.value(), projectionPlan->outputType(), {commonEvalResultRow})
Reason: Velox and reference DB results don't match

The mismatch data involves map columns with key-value pairs (BIGINT keys, DOUBLE values), indicating a result comparison failure between Velox evaluation and the Presto reference server.


🔴 Join Fuzzer — FUZZER Failure View logs

Failed instances: 2 of 4 instances failed (seeds: 226857074, 292938254); instances 2 and 3 passed.

JoinFuzzer.cpp:740, Function:verify
Expression: test::assertEqualResults(
    referenceResult.value(), defaultPlan.plan->outputType(), {expected})
Reason: Velox and Reference results don't match

Instance 1 (seed=226857074): Row mismatch in map column — a single key value differs by a small amount (7560112571951994724 vs 7560112586697594724), suggesting a subtle computation or data generation issue.

Instance 4 (seed=292938254): Similar assertEqualResults failure, also preceded by HashJoinBridge.cpp:372 errors ("Getting spill input after join is aborted").


🔴 Window Fuzzer with Presto as source of truth — FUZZER Failure View logs

Failed instances: 2 of 4 instances failed (seeds: 233822625, 432741263); instances 1 and 3 passed.

WindowFuzzer.cpp:802, Function:verifyWindow
Expression: assertEqualResults(
    expectedResult.value(), plan->outputType(), {resultOrError.result})
Reason: Velox and reference DB results don't match

Instance 2 (seed=233822625): max(ROW["c0"],ROW["c1"]) with RANGE between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, partition columns include ARRAY<TIMESTAMP WITH TIME ZONE> and ROW<..., TIMESTAMP WITH TIME ZONE, ...>.

Instance 4 (seed=432741263): min(ROW["c0"],ROW["c1"]) with RANGE between UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING, partition columns also include TIMESTAMP WITH TIME ZONE variants.

Both involve parse_datetime with timezone conversion in the plan, consistent with the known TIMESTAMP WITH TIME ZONE mismatch tracked in #17522.


Correlation with PR changes:

The PR (#17388) modifies only:

  • velox/connectors/hive/iceberg/ — Iceberg data sink statistics collection (new IcebergDataFileStatistics, IcebergParquetStatsCollector, Parquet stats integration)
  • velox/dwio/parquet/writer/arrow/ — Parquet writer statistics and string truncation
  • velox/functions/lib/string/StringImpl.h — String utility changes

None of these files are related to expression evaluation, join execution, window functions, or TIMESTAMP WITH TIME ZONE handling. The fuzzer failures are in the expression evaluation layer (ExpressionVerifier), join execution (JoinFuzzer), and window execution (WindowFuzzer), which are entirely different subsystems from Iceberg/Parquet write path.

Known issues:

  • ⚠️ The same three fuzzer jobs (Expression, Join, Window) also fail on the main branch. The last 5 consecutive Fuzzer Jobs runs on main all have failure conclusion (run 25892475646, run 25892374216, etc.), with the same fuzzer jobs failing.
  • 🐛 Issue #17522 — "Fuzzer mismatch on TIMESTAMP WITH TIME ZONE: Velox value off by 1 hour vs Presto" — directly matches the Window Fuzzer failures involving TIMESTAMP WITH TIME ZONE partition columns and parse_datetime timezone conversions.

These are pre-existing/flaky failures unrelated to this PR.

Reproduce locally:

# Expression Fuzzer (any seed from: 59238549, 932524077, 400533294, 254235930)
./_build/debug/velox/expression/fuzzer/velox_expression_fuzzer_test \
    --seed 59238549 --duration_sec 60 --logtostderr --minloglevel=0

# Join Fuzzer (seeds: 226857074, 292938254)
./_build/debug/velox/exec/fuzzer/velox_join_fuzzer_test \
    --seed 226857074 --duration_sec 60 --logtostderr --minloglevel=0

# Window Fuzzer (seeds: 233822625, 432741263)
./_build/debug/velox/functions/prestosql/fuzzer/velox_window_fuzzer_test \
    --seed 233822625 --duration_sec 60 --logtostderr --minloglevel=0

Note: Reproducing these failures locally requires a running Presto server as the reference source of truth (the fuzzers compare Velox results against Presto query results).

Recommended fix: No action needed from this PR. These are pre-existing fuzzer failures on main — tracked in issue #17522 and other known fuzzer flakiness.

@mohsaka mohsaka force-pushed the add-stats branch 6 times, most recently from fd6409c to 5a2a746 Compare May 1, 2026 21:03
@mohsaka mohsaka marked this pull request as ready for review May 4, 2026 15:26
@mohsaka mohsaka requested a review from majetideepak as a code owner May 4, 2026 15:26
@aditi-pandit aditi-pandit requested a review from Copilot May 4, 2026 18:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR restores and extends Iceberg Parquet data-file statistics collection and fixes failures caused by applying UTF-8 “round up” logic to non-UTF8 VARBINARY values. It also introduces a cross-format FileMetadata return from dwio::common::Writer::close() so Iceberg can aggregate Parquet row-group metadata into Iceberg commit-task metrics.

Changes:

  • Add roundUpBinary() and use it for Parquet BYTE_ARRAY upper-bound computation when the column is logical BINARY/VARBINARY (avoids UTF-8 validation/infinite-loop behavior).
  • Introduce dwio::common::FileMetadata and change writer close() APIs to return format-specific metadata (implemented for Parquet/DWRF/Text and plumbed through SortingWriter).
  • Add Iceberg stats plumbing (IcebergDataFileStatistics, IcebergParquetStatsCollector) and comprehensive Parquet stats tests; update Iceberg commit message metrics to include full Iceberg metrics JSON.

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
velox/functions/lib/string/StringImpl.h Adds roundUpBinary() for Iceberg-style binary upper bounds.
velox/dwio/text/writer/TextWriter.h Updates writer close() to return FileMetadata; adds placeholder metadata type.
velox/dwio/text/writer/TextWriter.cpp Returns TextFileMetadata from close().
velox/dwio/parquet/writer/WriterConfig.h Adds Parquet writer config constants split out from Arrow-heavy headers.
velox/dwio/parquet/writer/Writer.h Adds ParquetFileMetadata, makes close() return metadata, and wires WriterOptions to WriterConfig.
velox/dwio/parquet/writer/Writer.cpp Returns ParquetFileMetadata from close() after flushing/closing Arrow writer.
velox/dwio/parquet/writer/CMakeLists.txt Exposes WriterConfig.h as interface headers.
velox/dwio/parquet/writer/arrow/Statistics.cpp Chooses UTF-8 vs binary rounding for BYTE_ARRAY upper bounds based on logical type.
velox/dwio/dwrf/writer/Writer.h Updates close() signature and adds placeholder metadata type.
velox/dwio/dwrf/writer/Writer.cpp Returns DwrfFileMetadata from close().
velox/dwio/common/Writer.h Changes Writer::close() to return std::unique_ptr<FileMetadata>.
velox/dwio/common/tests/WriterTest.cpp Updates mock writer to match new close() signature.
velox/dwio/common/tests/SortingWriterTest.cpp Updates mock writer to match new close() signature.
velox/dwio/common/SortingWriter.h Updates close() signature and documents metadata return.
velox/dwio/common/SortingWriter.cpp Propagates metadata return from wrapped writer.
velox/dwio/common/FileMetadata.h Adds new base class for format-specific file metadata.
velox/dwio/common/CMakeLists.txt Adds FileMetadata.h to the common library headers.
velox/connectors/hive/iceberg/tests/IcebergParquetStatsTest.cpp Adds end-to-end tests for Parquet Iceberg stats (bounds/counts/nulls/nans/nested).
velox/connectors/hive/iceberg/tests/CMakeLists.txt Registers the new Parquet stats test.
velox/connectors/hive/iceberg/IcebergParquetStatsCollector.h Declares aggregator from Parquet metadata to Iceberg metrics.
velox/connectors/hive/iceberg/IcebergParquetStatsCollector.cpp Implements stats aggregation across row groups and bounds handling rules.
velox/connectors/hive/iceberg/IcebergDataSink.h Adds stats storage, Parquet stats collector, and overrides rotation/close to capture metadata.
velox/connectors/hive/iceberg/IcebergDataSink.cpp Populates commit-task metrics from collected file stats and captures writer metadata on rotate/close.
velox/connectors/hive/iceberg/IcebergDataFileStatistics.h Introduces an Iceberg metrics struct with JSON serialization.
velox/connectors/hive/iceberg/IcebergDataFileStatistics.cpp Implements toJson() for commit-task metrics.
velox/connectors/hive/iceberg/CMakeLists.txt Builds/links the new Iceberg stats sources (conditionally for Parquet).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread velox/functions/lib/string/StringImpl.h Outdated
Comment thread velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated
Comment thread velox/connectors/hive/iceberg/IcebergDataSink.cpp
Comment thread velox/connectors/hive/iceberg/IcebergDataSink.cpp Outdated
@mohsaka mohsaka force-pushed the add-stats branch 2 times, most recently from e0e8e95 to cd8ed81 Compare May 4, 2026 19:07
Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mohsaka Thank you for the updates!

Re roundUpBinary: StringImpl.h is a general-purpose string utilities header. roundUpBinary is Iceberg-specific (the docstring even says "used in Apache Iceberg"). Placing it there pollutes a general header with domain-specific logic. The only caller is Statistics.cpp — please move roundUpBinary there as a free function in an anonymous namespace. The concern about Iceberg code in the Parquet area is moot since the dependency already exists. You can still have unit tests — just place both the functions and the tests next to Statistics.cpp, not in a different module. Same applies to roundUpUtf8 — its only caller is also Statistics.cpp. Since you're already touching this code, please move both functions there.

Re WriterConfig: I checked D93449945 (PR #16062) — WriterConfig exists because Gluten's WholeStageResultIterator.cc needs the config constants but can't include Writer.h due to Arrow header conflicts. That's a legitimate reason to keep it. However, WriterOptions should not inherit from WriterConfig — please drop the multiple inheritance. All callers should reference the constants as WriterConfig::kFoo (there are ~31 usages, all in ParquetWriterTest.cpp — a straightforward find-and-replace). Please also add a comment in WriterConfig.h explaining why these constants must stay in a separate header.

Landing risk: This PR changes the Writer::close() signature from virtual void close() to virtual std::unique_ptr<FileMetadata> close() in dwio/common/Writer.h. PR #16062 attempted the same change and it broke Gluten builds and caused vtable mismatches (heap-buffer-overflow in TextWriter::close()), blocked a release, and required VELOX_ENABLE_BACKWARD_COMPATIBILITY macro workarounds across dozens of internal BUCK targets. PR #16062 was merged and then reverted (PR #16999), so none of those workarounds are in place.

Please split out the API changes into a small prep PR: the Writer::close() signature change and WriterConfig.h. We'll land it internally and work through any build issues (Gluten, Axiom, etc.) before adding the Iceberg stats on top.

@kgpai
Copy link
Copy Markdown
Contributor

kgpai commented May 13, 2026

FYI, Folks I am working on fixing the netlify errors - Hoping to have a PR soon ( within a day).

@mohsaka
Copy link
Copy Markdown
Collaborator Author

mohsaka commented May 14, 2026

Prep-PR opened here:
#17509

meta-codesync Bot pushed a commit that referenced this pull request May 14, 2026
…Config constants (#17509)

Summary:
This PR makes two improvements to the Velox writer infrastructure:

**1. Return FileMetadata from Writer::close()**
- Modified `Writer::close()` to return `std::unique_ptr<FileMetadata>` instead of void
- Added `FileMetadata` base class and format-specific implementations (`ParquetFileMetadata`, `TextFileMetadata`)
- Enables callers to access file-level statistics and metadata after writing
- Returns nullptr for empty files

**2. Add WriterConfig constants**
- Created new `WriterConfig.h` header with Parquet writer configuration constants
- Allows external projects (e.g., Gluten) to access config constants without Arrow dependencies
- Updated all test references to use new constants

All existing tests updated and passing. Prep-PR for #17388.

Pull Request resolved: #17509

Reviewed By: apurva-meta

Differential Revision: D105173381

Pulled By: mbasmanova

fbshipit-source-id: 39bfbbd2445ea73291c8551e142e79120b8cefde
@mohsaka
Copy link
Copy Markdown
Collaborator Author

mohsaka commented May 14, 2026

@mohsaka Thank you for the updates!

Re roundUpBinary: StringImpl.h is a general-purpose string utilities header. roundUpBinary is Iceberg-specific (the docstring even says "used in Apache Iceberg"). Placing it there pollutes a general header with domain-specific logic. The only caller is Statistics.cpp — please move roundUpBinary there as a free function in an anonymous namespace. The concern about Iceberg code in the Parquet area is moot since the dependency already exists. You can still have unit tests — just place both the functions and the tests next to Statistics.cpp, not in a different module. Same applies to roundUpUtf8 — its only caller is also Statistics.cpp. Since you're already touching this code, please move both functions there.

Re WriterConfig: I checked D93449945 (PR #16062) — WriterConfig exists because Gluten's WholeStageResultIterator.cc needs the config constants but can't include Writer.h due to Arrow header conflicts. That's a legitimate reason to keep it. However, WriterOptions should not inherit from WriterConfig — please drop the multiple inheritance. All callers should reference the constants as WriterConfig::kFoo (there are ~31 usages, all in ParquetWriterTest.cpp — a straightforward find-and-replace). Please also add a comment in WriterConfig.h explaining why these constants must stay in a separate header.

Landing risk: This PR changes the Writer::close() signature from virtual void close() to virtual std::unique_ptr<FileMetadata> close() in dwio/common/Writer.h. PR #16062 attempted the same change and it broke Gluten builds and caused vtable mismatches (heap-buffer-overflow in TextWriter::close()), blocked a release, and required VELOX_ENABLE_BACKWARD_COMPATIBILITY macro workarounds across dozens of internal BUCK targets. PR #16062 was merged and then reverted (PR #16999), so none of those workarounds are in place.

Please split out the API changes into a small prep PR: the Writer::close() signature change and WriterConfig.h. We'll land it internally and work through any build issues (Gluten, Axiom, etc.) before adding the Iceberg stats on top.

@mbasmanova Thank you for merging in the other two comments. I've addressed the last one in the newest commit of this PR. I couldn't quite wrap my head around putting the functions into Statistics.cpp and as well as the test cases?

That being said, I've moved out the Iceberg statistics functions from StringImpl.h and moved them into a dedicated header under velox/dwio/parquet/writer/arrow/IcebergStatisticsStringUtils.h to avoid pollution of the general-purpose string utilities header.

I did attempt to create an anonymous namespace with these functions in Statistics.cpp but that made them only accessible from within the cpp file itself. Therefore I was not able to access them from StatisticsTest.cpp, which is where I moved the test cases fromStringImplTest.cpp.

Could you let me know if this is okay? Or you had something else in mind. Thank you!

@mohsaka mohsaka force-pushed the add-stats branch 2 times, most recently from 95e10de to 1f5df19 Compare May 14, 2026 22:24
@mohsaka mohsaka requested a review from mbasmanova May 14, 2026 22:27
@mbasmanova
Copy link
Copy Markdown
Contributor

@mohsaka Thank you for the update!

A few issues with the current placement:

  • roundUpUtf8 is still in StringImpl.h (only roundUpBinary was moved). Please move both.
  • IcebergStatisticsStringUtils.h — the name violates the *Utils naming rule, and the functions aren't Iceberg-specific. They're general-purpose string truncation operations. Rename to something like StringTruncation.h.
  • The header has an anonymous namespace — this is a bug in a header file (every translation unit gets its own copy). Move implementations to a .cpp file.
  • The roundUp/truncate tests are in StatisticsTest.cpp but they test the truncation functions, not statistics. Create a separate test file (e.g., StringTruncationTest.cpp).

@mohsaka
Copy link
Copy Markdown
Collaborator Author

mohsaka commented May 15, 2026

@mbasmanova Thank you for the feedback again! I couldn't find roundUpUtf8 in StringImpl.h at commit 384e88e. Maybe the review was using an old commit?

Addressed all of the other comments. Please take another look when you have the chance!

Copy link
Copy Markdown
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all the feedback!

@mbasmanova mbasmanova added the ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall label May 15, 2026
@mbasmanova
Copy link
Copy Markdown
Contributor

Hi @mohsaka,

I'm landing fixes on top of your PR (D104861208 / #17388) to clear the import-side CI failures and address a few inline review items. Heads-up so you're not surprised when the merged commit shows additional OSS-visible changes.

velox/dwio/parquet/writer/arrow/Statistics.cpp

  • icebergLowerBoundInclusive now dispatches on descr_->logicalType()->isString(), mirroring the existing dispatch in icebergUpperBoundExclusive. Non-string ByteArray takes a raw-byte string_view::substr prefix instead of truncateUtf8. Closes the asymmetry where the lower bound used truncateUtf8 even on raw binary.

velox/dwio/parquet/writer/arrow/tests/StatisticsTest.cpp

  • Added TEST(IcebergStatistics, byteArrayBoundsBinary) covering the lower/upper bound paths for non-string ByteArray (BINARY/VARBINARY) — verifies raw-byte prefix on the lower bound and binary round-up on the upper bound, with inputs containing 0xff bytes that would not round-trip through the UTF-8 paths.

velox/connectors/hive/iceberg/IcebergDataSink.{h,cpp}

  • createWriterOptions now dispatches through createWriterOptionsAdapter(format) and calls applyPreConfigs / applyPostConfigs. The previous code only handled Parquet inline and silently dropped the DWRF applyPostConfigs step (which sets adjustTimestampToTimezone=false and sessionTimezone=nullptr per the Iceberg spec).
  • Extracted a closeWriterAndCollectStats helper to remove ~15 lines of duplicated close + finalize + aggregate logic between rotateWriter() and closeInternal().
  • Added doc comments on the rotateWriter and closeInternal overrides.

velox/connectors/hive/iceberg/WriterOptionsAdapter.cpp

  • ParquetWriterOptionsAdapter::applyPreConfigs uses parquet::WriterConfig::kParquetSerdeTimestampUnit / kParquetSerdeTimestampTimezone constants instead of the literal strings.

velox/connectors/hive/iceberg/IcebergParquetStatsCollector.h

  • Added a class-level doc comment.

velox/connectors/hive/iceberg/tests/WriterOptionsAdapterTest.cpp

  • Added parquetPreConfigsSetsTimestampSerdeParameters and dwrfPostConfigsOverridesTimestampFields to lock the adapter's pre/post-config contracts.

velox/connectors/hive/iceberg/tests/IcebergDwrfInsertTest.cpp

  • Updated the docstring on timestampRoundTrip (originally added by @apurvak) with a TODO noting that the symmetric Velox-only round-trip cannot, on its own, detect a regression of the DWRF timezone override — the new adapter unit test fills that gap. A true cross-engine validation (e.g., a Java Spark reader) is still needed for end-to-end on-disk verification.

Let me know if any of these don't sit right.

@mohsaka
Copy link
Copy Markdown
Collaborator Author

mohsaka commented May 15, 2026

@mbasmanova Everything looks fine to me. Thank you!

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented May 15, 2026

@mbasmanova merged this pull request in fd130f4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. Merged ready-to-merge PR that have been reviewed and are ready for merging. PRs with this tag notify the Velox Meta oncall

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants