Skip to content

Add support for deletion vector in Iceberg#24882

Merged
ebyhr merged 1 commit into
trinodb:masterfrom
ebyhr:ebi/iceberg-v3-max
Mar 31, 2025
Merged

Add support for deletion vector in Iceberg#24882
ebyhr merged 1 commit into
trinodb:masterfrom
ebyhr:ebi/iceberg-v3-max

Conversation

@ebyhr
Copy link
Copy Markdown
Member

@ebyhr ebyhr commented Feb 3, 2025

Description

Add support for deletion vectors (DV) in Iceberg.

A main difference from V2 positional deletes is that only single DV file is allowed per data file.
We need to merge the previous DV to the new DV.

Iceberg says "Position delete files are deprecated in v3", but it's actually prohibited.
Iceberg library throws exceptions if we add legacy position delete files to V3 tables.

The default format version stays at 2 for compatibility with Spark.

Fixes #24457

Release notes

## Iceberg
* Add support for [deletion vectors](https://iceberg.apache.org/spec/#deletion-vectors) of V3 spec. ({issue}`24457`)

@cla-bot cla-bot Bot added the cla-signed label Feb 3, 2025
@github-actions github-actions Bot added docs iceberg Iceberg connector labels Feb 3, 2025
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch 9 times, most recently from 3104a24 to 6ff0af1 Compare February 7, 2025 05:21
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch 3 times, most recently from 2c25635 to 8f37aff Compare February 15, 2025 08:53
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch 8 times, most recently from 0df8334 to 38a1b97 Compare March 3, 2025 06:27
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch 2 times, most recently from 4d53e64 to 29e196e Compare March 5, 2025 06:53
@ebyhr ebyhr marked this pull request as ready for review March 5, 2025 10:56
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch 2 times, most recently from d2912f1 to f7d4b20 Compare March 5, 2025 11:03
@ebyhr ebyhr requested review from raunaqmorarka and wendigo March 5, 2025 11:05
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from 68d4564 to f3f8642 Compare March 7, 2025 11:12
Copy link
Copy Markdown
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that slight code refactoring is needed around the way we handle pass task data when doing data file creation & position deletes merge handling.

Very nice to see the functionality already working.

I see we're boldly going with v3 format version for doing the testing, while the default is v2 - maybe we should duplicate the testing for now to be on the safe-side.

Looking forward to release this functionality to the public ❤️

Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated
Comment on lines 31 to 34
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method feels somehow artificial for the generic IcebergFileWriter

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it used in IcebergMetadat#finishWrite for PUFFIN only from within the CommitTaskData

I don't have right now a solution to suggest, but I feel we can do better also with CommitTaskData.
It currently contains a set of seemingly fields.

Did we explore the avenue of having DataCommitTaskData & PositionDeleteCommitTaskData (& DeletionVectorCommitTaskData) ?

@findinpath

This comment was marked as outdated.

@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch 3 times, most recently from 9684b95 to 1c09d52 Compare March 11, 2025 00:25
@ebyhr ebyhr requested a review from anusudarsan March 14, 2025 02:57
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMergeSink.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/CommitTaskData.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/CommitTaskData.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriter.java Outdated
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from 1c09d52 to e5ed0cf Compare March 18, 2025 02:12
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Mar 18, 2025

Addressed comments.

@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from e5ed0cf to d7d71c3 Compare March 18, 2025 07:31
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Mar 18, 2025

Addressed comments.

Copy link
Copy Markdown
Member

@anusudarsan anusudarsan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

Comment thread plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/TestIcebergV2.java Outdated
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from 0303555 to 8971cf6 Compare March 20, 2025 23:40
@ebyhr
Copy link
Copy Markdown
Member Author

ebyhr commented Mar 20, 2025

Squashed commits into one and added TestIcebergV2.testPositionDeleteAndDeletionVector executing DELETE statement with both V2 and V3 in a table.

Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java Outdated
@ebyhr ebyhr requested a review from wendigo March 24, 2025 14:00
@wendigo wendigo requested a review from Copilot March 27, 2025 10:12
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for deletion vectors in Iceberg by introducing new file writers and updating related components to work with the V3 deletion vector specification. Key changes include:

  • Adding new classes DeletionVectorWriter and DeletionVectorFileWriter to handle deletion vector file creation and merging.
  • Updating methods and constructors across multiple files (e.g. DeleteManager, IcebergWritableTableHandle, IcebergFileWriterFactory) to propagate and work with deletion vector related data.
  • Adjusting error messages, configuration defaults, and file format handling to support the new V3 spec while maintaining compatibility with format version 2 for Spark.

Reviewed Changes

Copilot reviewed 43 out of 44 changed files in this pull request and generated no comments.

Show a summary per file
File Description
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVectorWriter.java Introduces writer for deletion vector files using PUFFIN format.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeletionVectorFileWriter.java Implements file writing logic for deletion vectors including merging previous deletes.
plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/delete/DeleteManager.java Updates delete predicate creation to support deletion vectors.
Various Iceberg* files Update writers, metadata, and configuration to include deletion vector support in the merge/commit paths.
docs/src/main/sphinx/connector/iceberg.md Updates documentation for the supported Iceberg format versions.
Files not reviewed (1)
  • plugin/trino-iceberg/pom.xml: Language not supported
Comments suppressed due to low confidence (3)

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergFileWriterFactory.java:175

  • Consider validating that DELETE_FILE_POS exists in POSITION_DELETE_SCHEMA.columns() before using its index to avoid a -1 result, which could lead to unexpected behavior.
int positionChannel = POSITION_DELETE_SCHEMA.columns().indexOf(DELETE_FILE_POS);

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/CommitTaskData.java:28

  • [nitpick] Ensure that the transition from IcebergFileFormat to FileFormat is coordinated across the codebase to maintain consistent file format handling.
FileFormat fileFormat,

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergMetadata.java:3166

  • [nitpick] Consider including additional contextual information (e.g., the associated data file path) in the error message to improve debuggability when deletion vector metadata is missing.
deleteBuilder.withContentOffset(task.deletionVectorContentOffset().orElseThrow(() -> new IllegalArgumentException("deletionVectorContentOffset is required for deletion vector")));

@ebyhr ebyhr requested a review from electrum March 27, 2025 20:31
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from 8971cf6 to 94f838c Compare March 31, 2025 00:04
Copy link
Copy Markdown
Contributor

@wendigo wendigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/CommitTaskData.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergConfig.java Outdated
Comment thread plugin/trino-iceberg/src/main/java/org/apache/iceberg/ContentFileParsers.java Outdated
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from 94f838c to fbb5780 Compare March 31, 2025 11:06
@ebyhr ebyhr force-pushed the ebi/iceberg-v3-max branch from fbb5780 to 4d11dca Compare March 31, 2025 21:44
@ebyhr ebyhr merged commit 2645d51 into trinodb:master Mar 31, 2025
97 checks passed
@ebyhr ebyhr deleted the ebi/iceberg-v3-max branch March 31, 2025 22:39
@github-actions github-actions Bot added this to the 475 milestone Mar 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

Add support for deletion vector in Iceberg connector

5 participants