Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions core/src/main/java/org/apache/iceberg/ColumnFileInfo.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/
package org.apache.iceberg;

import java.util.List;
import org.apache.iceberg.types.Types;

/** Information about a column file. */
interface ColumnFileInfo {
Copy link
Copy Markdown
Contributor

@anuragmantri anuragmantri May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we always assume the column file uses the same file_format as base file? If not, should we explicitly add file_format here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also crossed my mind, but ATM I don't think we want to add that extra complexity and it is somewhat speculative now. We could introduce that later if there is a clear need.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe simply ColumnFile instead of ColumnFileInfo?

Types.NestedField FIELD_IDS =
Types.NestedField.required(
159,
"field_ids",
Types.ListType.ofRequired(160, Types.IntegerType.get()),
"Field IDs this column file contains");
Types.NestedField LOCATION =
Types.NestedField.required(
161, "location", Types.StringType.get(), "Location of the column file");
Types.NestedField FILE_SIZE_IN_BYTES =
Types.NestedField.required(
162, "file_size_in_bytes", Types.LongType.get(), "Total column file size in bytes");
Comment on lines +26 to +37
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we wait for finalizing the spec? I will create a spec PR when it has been finalized and then we can continue this.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the conversation on the exact layout could go in parallel. I just sketched in this PR how it would look like in terms of code. Hopefully we can finalize this in the next sync.

Types.NestedField SEQUENCE_NUMBER =
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage for this was meant to be for equality deletes. Since we agree not to support it together with column updates, I think we can drop seq_nums from this schema.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use this for las-updated-sequence-number for the rows that we update:

  • The rows that are unchanged can have last-updated-seq-num written into the update file
  • The rows that are changed can have that filed as null and derive the last-updated-seq-num from the metadata here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look: I think we need this field to provide _last_updated_sequence_number. For that we either write to the file or if it is null then we can use this one here in the metadata. Emphasize, this is file sequence number

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I have iterated on this a couple of times. I think there is a way how to avoid storing sequence_number here. Since this gets complicated, I added a section to the design doc to explain.

Types.NestedField.optional(
163, "sequence_number", Types.LongType.get(), "Sequence number of the column file");

static Types.StructType schema() {
return Types.StructType.of(FIELD_IDS, LOCATION, FILE_SIZE_IN_BYTES, SEQUENCE_NUMBER);
}

/** Returns the field IDs contained in this column file. */
List<Integer> fieldIds();

/** Returns the location of the column file. */
String location();

/** Returns the total size of the column file in bytes. */
long fileSizeInBytes();

/** Returns the sequence number of the column file, or null if not set. */
Long sequenceNumber();

/** Copies this column file info. */
ColumnFileInfo copy();
}
12 changes: 11 additions & 1 deletion core/src/main/java/org/apache/iceberg/TrackedFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,12 @@ interface TrackedFile {
"equality_ids",
Types.ListType.ofRequired(136, Types.IntegerType.get()),
"Field ids used to determine row equality in equality delete files");
Types.NestedField COLUMN_FILES =
Types.NestedField.optional(
157,
"column_files",
Types.ListType.ofRequired(158, ColumnFileInfo.schema()),
"Column update files");

static Types.StructType schemaWithContentStats(
Types.StructType partitionType, Types.StructType contentStatsType) {
Expand All @@ -110,7 +116,8 @@ static Types.StructType schemaWithContentStats(
MANIFEST_INFO,
KEY_METADATA,
SPLIT_OFFSETS,
EQUALITY_IDS);
EQUALITY_IDS,
COLUMN_FILES);
}

/** Returns the tracking information for this entry. */
Expand Down Expand Up @@ -158,6 +165,9 @@ static Types.StructType schemaWithContentStats(
/** Returns the set of field IDs used for equality comparison in equality delete files. */
List<Integer> equalityIds();

/** Returns the column files for this file. */
List<ColumnFileInfo> columnFiles();

/** Copies this tracked file. */
TrackedFile copy();

Expand Down
23 changes: 22 additions & 1 deletion core/src/main/java/org/apache/iceberg/TrackedFileStruct.java
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,10 @@
import java.io.Serializable;
import java.nio.ByteBuffer;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.Set;
import java.util.stream.Collectors;
import org.apache.iceberg.avro.SupportsIndexProjection;
import org.apache.iceberg.relocated.com.google.common.base.MoreObjects;
import org.apache.iceberg.types.Type;
Expand Down Expand Up @@ -65,7 +67,8 @@ public PartitionData copy() {
TrackedFile.MANIFEST_INFO,
TrackedFile.KEY_METADATA,
TrackedFile.SPLIT_OFFSETS,
TrackedFile.EQUALITY_IDS);
TrackedFile.EQUALITY_IDS,
TrackedFile.COLUMN_FILES);

private FileContent contentType = null;
private String location = null;
Expand All @@ -81,6 +84,7 @@ public PartitionData copy() {
private Integer sortOrderId = null;
private DeletionVector deletionVector = null;
private ManifestInfo manifestInfo = null;
private List<ColumnFileInfo> columnFiles = null;
private byte[] keyMetadata = null;
private long[] splitOffsets = null;
private int[] equalityIds = null;
Expand Down Expand Up @@ -155,6 +159,10 @@ private TrackedFileStruct(TrackedFileStruct toCopy, boolean withStats, Set<Integ
toCopy.equalityIds != null
? Arrays.copyOf(toCopy.equalityIds, toCopy.equalityIds.length)
: null;
this.columnFiles =
toCopy.columnFiles != null
? toCopy.columnFiles.stream().map(ColumnFileInfo::copy).collect(Collectors.toList())
: null;
}

@Override
Expand Down Expand Up @@ -232,6 +240,11 @@ public List<Integer> equalityIds() {
return equalityIds != null ? ArrayUtil.toUnmodifiableIntList(equalityIds) : null;
}

@Override
public List<ColumnFileInfo> columnFiles() {
return columnFiles != null ? Collections.unmodifiableList(columnFiles) : null;
}
Comment thread
gaborkaszab marked this conversation as resolved.

@Override
public TrackedFile copy() {
return new TrackedFileStruct(this, true, null);
Expand Down Expand Up @@ -279,6 +292,8 @@ private Object getByPos(int pos) {
return splitOffsets();
case 14:
return equalityIds();
case 15:
return columnFiles;
default:
throw new UnsupportedOperationException("Unknown field ordinal: " + pos);
}
Expand Down Expand Up @@ -333,6 +348,11 @@ protected <T> void internalSet(int pos, T value) {
case 14:
this.equalityIds = ArrayUtil.toIntArray((List<Integer>) value);
break;
case 15:
this.columnFiles =
((List<ColumnFileInfo>) value)
.stream().map(ColumnFileInfo::copy).collect(Collectors.toList());
break;
default:
// ignore the object, it must be from a newer version of the format
}
Expand All @@ -356,6 +376,7 @@ public String toString() {
.add("key_metadata", keyMetadata == null ? "null" : "(redacted)")
.add("split_offsets", splitOffsets == null ? "null" : splitOffsets())
.add("equality_ids", equalityIds == null ? "null" : equalityIds())
.add("column_files", columnFiles == null ? "null" : columnFiles)
.toString();
}
}
6 changes: 4 additions & 2 deletions core/src/test/java/org/apache/iceberg/TestTrackedFile.java
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,8 @@ public void schemaWithContentStatsFieldOrder() {
"manifest_info",
"key_metadata",
"split_offsets",
"equality_ids");
"equality_ids",
"column_files");
}

@Test
Expand All @@ -69,7 +70,8 @@ public void schemaWithContentStatsFieldIds() {

assertThat(fields)
.extracting(Types.NestedField::fieldId)
.containsExactly(147, 134, 100, 101, 103, 104, 141, 102, 146, 140, 148, 150, 131, 132, 135);
.containsExactly(
147, 134, 100, 101, 103, 104, 141, 102, 146, 140, 148, 150, 131, 132, 135, 157);
}

@Test
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ void testCopyIsDeep() {
@Test
void testStructLikeSize() {
TrackedFileStruct file = new TrackedFileStruct();
assertThat(file.size()).isEqualTo(15);
assertThat(file.size()).isEqualTo(16);
}

@Test
Expand Down