Skip to content

Core: Introduce interface for column files#16285

Open
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_column_file_interface
Open

Core: Introduce interface for column files#16285
gaborkaszab wants to merge 1 commit into
apache:mainfrom
gaborkaszab:main_column_file_interface

Conversation

@gaborkaszab
Copy link
Copy Markdown
Contributor

This change introduces the interface for column files and also integrates it to the schema for TrackedFile.

@github-actions github-actions Bot added the core label May 11, 2026
Types.NestedField FILE_SIZE_IN_BYTES =
Types.NestedField.required(
162, "file_size_in_bytes", Types.LongType.get(), "Total column file size in bytes");
Types.NestedField SEQUENCE_NUMBER =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage for this was meant to be for equality deletes. Since we agree not to support it together with column updates, I think we can drop seq_nums from this schema.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use this for las-updated-sequence-number for the rows that we update:

  • The rows that are unchanged can have last-updated-seq-num written into the update file
  • The rows that are changed can have that filed as null and derive the last-updated-seq-num from the metadata here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look: I think we need this field to provide _last_updated_sequence_number. For that we either write to the file or if it is null then we can use this one here in the metadata. Emphasize, this is file sequence number

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have iterated on this a couple of times. I think there is a way how to avoid storing sequence_number here. Since this gets complicated, I added a section to the design doc to explain.

@gaborkaszab
Copy link
Copy Markdown
Contributor Author

First piece of the column update work: introducing the basic interface of the column updates files, aka column files
cc @anuragmantri @rdblue @pvary @RussellSpitzer @amogh-jahagirdar @anoopj @nastra

Comment on lines +26 to +37
Types.NestedField FIELD_IDS =
Types.NestedField.required(
159,
"field_ids",
Types.ListType.ofRequired(160, Types.IntegerType.get()),
"Field IDs this column file contains");
Types.NestedField LOCATION =
Types.NestedField.required(
161, "location", Types.StringType.get(), "Location of the column file");
Types.NestedField FILE_SIZE_IN_BYTES =
Types.NestedField.required(
162, "file_size_in_bytes", Types.LongType.get(), "Total column file size in bytes");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for finalizing the spec? I will create a spec PR when it has been finalized and then we can continue this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the conversation on the exact layout could go in parallel. I just sketched in this PR how it would look like in terms of code. Hopefully we can finalize this in the next sync.

Comment thread core/src/main/java/org/apache/iceberg/ColumnFileInfo.java Outdated
Comment thread core/src/main/java/org/apache/iceberg/TrackedFileStruct.java
import org.apache.iceberg.types.Types;

/** Information about a column file. */
interface ColumnFileInfo {
Copy link
Copy Markdown
Contributor

@anuragmantri anuragmantri May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always assume the column file uses the same file_format as base file? If not, should we explicitly add file_format here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also crossed my mind, but ATM I don't think we want to add that extra complexity and it is somewhat speculative now. We could introduce that later if there is a clear need.

@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from ca3259e to e6f7cf6 Compare May 12, 2026 09:35
import org.apache.iceberg.types.Types;

/** Information about a column file. */
interface ColumnFileInfo {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe simply ColumnFile instead of ColumnFileInfo?

Types.NestedField FILE_SIZE_IN_BYTES =
Types.NestedField.required(
162, "file_size_in_bytes", Types.LongType.get(), "Total column file size in bytes");
Types.NestedField SEQUENCE_NUMBER =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look: I think we need this field to provide _last_updated_sequence_number. For that we either write to the file or if it is null then we can use this one here in the metadata. Emphasize, this is file sequence number

@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch 2 times, most recently from 5599a15 to 681633b Compare May 13, 2026 12:20
This change introduces the interface for column files and also
integrates it to the schema for TrackedFile.
@gaborkaszab gaborkaszab force-pushed the main_column_file_interface branch from 681633b to 813d5c0 Compare May 13, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants