Parquet: Add opt-in uncompressed row group size tracking by nssalian · Pull Request #16327 · apache/iceberg

nssalian · 2026-05-14T01:48:55Z

Rationale for this Change

Adds write.parquet.row-group-size-check-uncompressed (default false) to accurately enforce write.parquet.row-group-size-bytes when using compressing codecs (GZIP, ZSTD, etc.).

ParquetWriter.checkSize() uses writeStore.getBufferedSize() which reports compressed bytes for flushed pages. With effective compression, the writer never sees the target exceeded because it's comparing compressed data against an uncompressed limit. Row groups grow unbounded.

What changes are included in this PR?

When write.parquet.row-group-size-check-uncompressed=true:

Measures getBufferedSize() before and after model.write() per record. Between these points, data is in uncompressed column buffers (no page flush occurs during model.write()). The delta is the exact uncompressed record size.
Accumulates into rowGroupUncompressedSize. Flushes when it hits the target.
Removes the 100-record minimum check interval floor for the uncompressed path.

Disabled by default.

When enabled getBufferedSize() calls per record. Each call iterates column writers adding field reads. It's the same pattern parquet-mr uses in ColumnWriteStoreBase.sizeCheck(). Kept it optional so as to not cause an immediate change to the default behavior.

Are these changes tested?

Parameterized test across all codecs (gzip, snappy, zstd, uncompressed)
Existing parquet tests pass locally

Are there any user-facing changes?

Yes. New configuration but set to false by default.

nssalian · 2026-05-15T14:08:40Z

CC: @pvary @steveloughran @huaxingao @aihuaxu PTAL

steveloughran

LGTM, good test coverage.

Parquet: Add opt-in uncompressed row group size tracking

74daded

github-actions Bot added parquet core docs labels May 14, 2026

nssalian marked this pull request as ready for review May 14, 2026 03:14

nssalian mentioned this pull request May 15, 2026

Parquet: Row group size limit not enforced when using GZIP or ZSTD compression #16325

Open

3 tasks

steveloughran reviewed May 15, 2026

View reviewed changes

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java Outdated

Comment thread parquet/src/test/java/org/apache/iceberg/parquet/TestParquetDataWriter.java

nssalian mentioned this pull request May 15, 2026

Parquet: Enforce row group size limit with compression #16347

Open

Adding test and rename checkSize method

5672040

nssalian requested a review from steveloughran May 15, 2026 18:42

steveloughran reviewed May 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet: Add opt-in uncompressed row group size tracking#16327

Parquet: Add opt-in uncompressed row group size tracking#16327
nssalian wants to merge 2 commits into
apache:mainfrom
nssalian:fix-parquet-row-group-size

nssalian commented May 14, 2026 •

edited

Loading

Uh oh!

nssalian commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

steveloughran left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nssalian commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this Change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

nssalian commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nssalian commented May 14, 2026 •

edited

Loading

nssalian commented May 15, 2026 •

edited

Loading