Skip to content

Parquet: Add opt-in uncompressed row group size tracking#16327

Open
nssalian wants to merge 2 commits into
apache:mainfrom
nssalian:fix-parquet-row-group-size
Open

Parquet: Add opt-in uncompressed row group size tracking#16327
nssalian wants to merge 2 commits into
apache:mainfrom
nssalian:fix-parquet-row-group-size

Conversation

@nssalian
Copy link
Copy Markdown
Contributor

@nssalian nssalian commented May 14, 2026

Closes: #16325

Rationale for this Change

Adds write.parquet.row-group-size-check-uncompressed (default false) to accurately enforce write.parquet.row-group-size-bytes when using compressing codecs (GZIP, ZSTD, etc.).

ParquetWriter.checkSize() uses writeStore.getBufferedSize() which reports compressed bytes for flushed pages. With effective compression, the writer never sees the target exceeded because it's comparing compressed data against an uncompressed limit. Row groups grow unbounded.

What changes are included in this PR?

When write.parquet.row-group-size-check-uncompressed=true:

  1. Measures getBufferedSize() before and after model.write() per record. Between these points, data is in uncompressed column buffers (no page flush occurs during model.write()). The delta is the exact uncompressed record size.
  2. Accumulates into rowGroupUncompressedSize. Flushes when it hits the target.
  3. Removes the 100-record minimum check interval floor for the uncompressed path.

Disabled by default.

When enabled getBufferedSize() calls per record. Each call iterates column writers adding field reads. It's the same pattern parquet-mr uses in ColumnWriteStoreBase.sizeCheck(). Kept it optional so as to not cause an immediate change to the default behavior.

Are these changes tested?

  • Parameterized test across all codecs (gzip, snappy, zstd, uncompressed)
  • Existing parquet tests pass locally

Are there any user-facing changes?

Yes. New configuration but set to false by default.

@nssalian
Copy link
Copy Markdown
Contributor Author

nssalian commented May 15, 2026

CC: @pvary @steveloughran @huaxingao @aihuaxu PTAL

Comment thread parquet/src/main/java/org/apache/iceberg/parquet/ParquetWriter.java Outdated
@nssalian nssalian requested a review from steveloughran May 15, 2026 18:42
Copy link
Copy Markdown
Contributor

@steveloughran steveloughran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good test coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet: Row group size limit not enforced when using GZIP or ZSTD compression

2 participants