Support retrieving clustering information of Delta Lake tables. by yangshangqing95 · Pull Request #27052 · trinodb/trino

yangshangqing95 · 2025-10-22T15:23:17Z

Description

Support retrieving clustering information from Delta Lake tables, controlled by session and configuration settings.
By default, retrieving clustering information is disabled (false).
This serves as a foundation for future integration with the Delta Lake Liquid Clustering feature.
The work to fully support Delta Lake’s Liquid Clustering capability is already planned and in progress.

Additional context and related issues

About Delta Lake Liquid Clustering: https://delta.io/blog/liquid-clustering/

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Delta Lake
* Add support for retrieving clustering information from Delta Lake tables.

Summary by Sourcery

Enable optional retrieval of clustering information for Delta Lake tables and expose it as a new table property to support future Liquid Clustering integration.

New Features:

Introduce delta.show-clustered-columns session and catalog property to control retrieval of clustered columns
Add clustered_by table property and include it in SHOW CREATE TABLE when enabled
Implement clustering metadata extraction via ClusteringMetadataUtil reading Delta Lake transaction logs

Enhancements:

Propagate clustering columns through DeltaLakeTableHandle, metadata, and properties handling
Cache clustered columns in TableSnapshot and integrate with TransactionLogAccess

Documentation:

Document the new delta.show-clustered-columns property in connector documentation

Tests:

Add unit and integration tests for clustering metadata utilities, configuration mapping, session properties, and SHOW CREATE TABLE behavior

sourcery-ai · 2025-10-22T15:23:24Z

Reviewer's Guide

This PR adds support for retrieving clustering information from Delta Lake tables behind a new feature flag. It introduces a session/catalog property to toggle visibility, extends the metadata layer and table handles to carry an optional List of clustered columns, implements logic to parse clustering info from the transaction log (via a new utility and an Operation enum), and updates tests and documentation accordingly.

Sequence diagram for retrieving clustered columns from Delta Lake table

sequenceDiagram
    participant Session
    participant DeltaLakeMetadata
    participant TransactionLogAccess
    participant TableSnapshot
    participant ClusteringMetadataUtil

    Session->>DeltaLakeMetadata: getTableHandle(session, ...)
    DeltaLakeMetadata->>TransactionLogAccess: getClusteredColumns(fileSystem, tableSnapshot)
    TransactionLogAccess->>TableSnapshot: getCachedClusteredColumns()
    alt Not cached
        TransactionLogAccess->>ClusteringMetadataUtil: getLatestClusteredColumns(fileSystem, tableSnapshot)
        ClusteringMetadataUtil-->>TransactionLogAccess: clusteredColumns
        TransactionLogAccess->>TableSnapshot: setCachedClusteredColumns(clusteredColumns)
    end
    TransactionLogAccess-->>DeltaLakeMetadata: clusteredColumns
    DeltaLakeMetadata-->>Session: LocatedTableHandle(clusteredColumns)

ER diagram for Delta Lake table properties with clustering info

erDiagram
    DELTA_LAKE_TABLE_PROPERTIES {
        string location
        list partitioned_by
        list clustered_by
        long checkpoint_interval
        string change_data_feed_enabled
        string column_mapping_mode
    }
    DELTA_LAKE_TABLE_HANDLE {
        string location
        object metadata_entry
        object protocol_entry
        list clustered_columns
    }
    DELTA_LAKE_TABLE_PROPERTIES ||--|| DELTA_LAKE_TABLE_HANDLE : "table handle for properties"
    DELTA_LAKE_TABLE_HANDLE {
        list clustered_columns
    }

Class diagram for Delta Lake clustering metadata support

classDiagram
    class DeltaLakeConfig {
        - boolean showClusteredColumns
        + boolean isShowClusteredColumns()
        + DeltaLakeConfig setShowClusteredColumns(boolean)
    }
    class DeltaLakeSessionProperties {
        + static boolean ifShowClusteredColumns(ConnectorSession)
    }
    class DeltaLakeTableProperties {
        + static final String CLUSTER_BY_PROPERTY
        + static List<String> getClusteredBy(Map<String, Object>)
    }
    class DeltaLakeTableHandle {
        - Optional<List<String>> clusteredColumns
        + Optional<List<String>> getClusteredColumns()
    }
    class TableSnapshot {
        - Optional<List<String>> cachedClusteredColumns
        + Optional<List<String>> getCachedClusteredColumns()
        + void setCachedClusteredColumns(Optional<List<String>>)
    }
    class TransactionLogAccess {
        + Optional<List<String>> getClusteredColumns(TrinoFileSystem, TableSnapshot)
    }
    class ClusteringMetadataUtil {
        + static Optional<List<String>> getLatestClusteredColumns(TrinoFileSystem, TableSnapshot)
    }
    class Operation {
        <<enum>>
        + static Operation fromString(String)
    }

    DeltaLakeConfig --> DeltaLakeSessionProperties
    DeltaLakeSessionProperties --> DeltaLakeTableProperties
    DeltaLakeTableProperties --> DeltaLakeTableHandle
    DeltaLakeTableHandle --> TableSnapshot
    TableSnapshot --> TransactionLogAccess
    TransactionLogAccess --> ClusteringMetadataUtil
    ClusteringMetadataUtil --> Operation

File-Level Changes

Change	Details	Files
Feature flag for showing clustered columns	Add showClusteredColumns field with @config setter in DeltaLakeConfig Introduce SHOW_CLUSTERED_COLUMNS session property and ifShowClusteredColumns helper Update TestDeltaLakeConfig to cover default and explicit mappings Document delta.show-clustered-columns in sphinx connector docs	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSessionProperties.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeConfig.java` `docs/src/main/sphinx/connector/delta-lake.md`
Extend metadata and table handle to carry clusteredColumns	Retrieve clusteredColumns in DeltaLakeMetadata.getTableHandle when enabled and protocol supports clustering Include clusteredColumns property in getTableMetadata output Add clusteredColumns field, JSON annotations, constructors, equals/hashCode in DeltaLakeTableHandle Define CLUSTER_BY_PROPERTY and getClusteredBy helper in DeltaLakeTableProperties	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableHandle.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeTableProperties.java`
Implement transaction log parsing for clustering info	Extend TableSnapshot to cache clustered columns Add TransactionLogAccess.getClusteredColumns method Create ClusteringMetadataUtil to walk commitInfo entries and extract cluster columns Introduce Operation enum to map Delta operations to clustering keys	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TableSnapshot.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/transactionlog/TransactionLogAccess.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java` `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java`
Update and add tests to validate clustering info exposure	Register CLUSTERED_TABLES and add SHOW CREATE TABLE assertions in TestDeltaLakeBasic Update existing connector tests to include the new clusteredColumns parameter Add TestClusteringMetadataUtil and OperationTest for clustering utility coverage	`plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeFileBasedTableStatisticsProvider.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeNodeLocalDynamicSplitPruning.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestingDeltaLakeUtils.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeMetadata.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSplitManager.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestTransactionLogAccess.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/clustering/TestClusteringMetadataUtil.java` `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/clustering/OperationTest.java`
Expose temporal time-travel parameter	Add public static getTemporalTimeTravelLinearSearchMaxSize method in DeltaLakeMetadata	`plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java:94` </location>
<code_context>
+            REPLACE_TABLE_KEYWORD, CLUSTERING_PARAMETER_KEY,
+            CLUSTER_BY, NEW_CLUSTERING_PARAMETER_KEY);
+
+    private static final ThreadLocal<Map<String, String>> OLD_TO_NEW_RENAMED_COLUMNS = ThreadLocal.withInitial(HashMap::new);
+
+    private ClusteringMetadataUtil()
</code_context>

<issue_to_address>
**issue (bug_risk):** ThreadLocal usage for OLD_TO_NEW_RENAMED_COLUMNS may leak memory if not cleared in all code paths.

If an exception occurs before ThreadLocal removal, it may not be cleared. Use a try-finally block to guarantee cleanup and prevent memory leaks.
</issue_to_address>

### Comment 2
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/ClusteringMetadataUtil.java:249` </location>
<code_context>
+    }
+
+    @VisibleForTesting
+    static void recordRenamedColumns(CommitInfoEntry commitInfoEntry)
+    {
+        String oldName = commitInfoEntry.operationParameters().get(RENAMED_OLD_COLUMN_KEY);
</code_context>

<issue_to_address>
**suggestion:** The logic for updating OLD_TO_NEW_RENAMED_COLUMNS may not handle multiple renames correctly.

The current approach may lose information if a column is renamed multiple times. Please consider tracking all previous names to ensure the mapping remains accurate.
</issue_to_address>

### Comment 3
<location> `plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/clustering/Operation.java:96` </location>
<code_context>
+            CREATE_TABLE_KEYWORD.getOperationName().toLowerCase(), CREATE_TABLE_KEYWORD,
+            REPLACE_TABLE_KEYWORD.getOperationName().toLowerCase(), REPLACE_TABLE_KEYWORD);
+
+    public static Operation fromString(String operationName)
+    {
+        Operation operation = LOWERCASE_NAME_TO_OPERATION.get(operationName.toLowerCase());
</code_context>

<issue_to_address>
**suggestion:** fromString may return UNKNOW_OPERATION for valid but differently-cased or formatted operation names.

Currently, only exact matches are supported, so inputs with extra whitespace or formatting may not be recognized. Consider trimming whitespace or using regex to improve matching robustness.
</issue_to_address>

### Comment 4
<location> `plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java:2735-2736` </location>
<code_context>
         assertQuery("SELECT * FROM " + sourceTable, sourceTableValues);
     }

+    @Test
+    void testShowCreateTableWithClusteredInfo()
+    {
+        Session session = Session.builder(getSession())
</code_context>

<issue_to_address>
**suggestion (testing):** Test for clustered columns in SHOW CREATE TABLE covers both enabled and disabled states.

Please also add a test case for the default configuration (without setting the session property) to confirm expected default behavior.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

ebyhr · 2025-10-22T22:33:38Z

Is this PR preparatory work for future performance improvements, with no immediate benefit?

yangshangqing95 · 2025-10-23T02:04:29Z

Is this PR preparatory work for future performance improvements, with no immediate benefit?

Hi @ebyhr
From a performance perspective, this PR doesn’t introduce any immediate improvements. But functionally, it allows users to view clustering column information in the SHOW CREATE TABLE output for Delta tables.

ebyhr · 2025-10-23T02:06:47Z

Could you please share the final solution (= how to improve performance eventually)?

yangshangqing95 · 2025-10-23T02:39:32Z

Sure @ebyhr
The long-term goal is to leverage Delta Lake’s Liquid Clustering to improve query performance once full support is added. Here, full support refers to enabling Trino to handle reading, writing with Liquid Clustering, and performing optimize operations to reorganize Liquid Clustering, an so on.

Liquid Clustering is a flexible data layout mechanism that organizes data based on clustering keys, instead of traditional directory-based partitions. Unlike static partitions (e.g., /country=US/), it stores clustering information in metadata rather than the filesystem, allowing the data layout to evolve dynamically as new data is written. Also, Liquid Clustering is the best practice officially recommended by Delta Lake. You can find more information here: https://delta.io/blog/liquid-clustering/

In simple terms, compared with partitions, Liquid Clustering provides:

Flexible organization – data is grouped by clustering key ranges instead of fixed partition directories.
Dynamic adjustment – clustering keys can be updated at any time; new files automatically adhere to the updated clustering layout. Additionally, existing data can be reorganized through some operations, giving the table structure much greater evolution flexibility than static partitions.
Lower metadata overhead – no need to maintain thousands of partition directories.
Better query pruning – queries filtering on clustering keys can skip large data ranges even when those fields aren’t partitions. In addition, Liquid Clustering can use nested fields (e.g., person.age) as clustering keys. Delta Lake tracks the full field path and its value range in metadata, so even struct subfields can benefit from clustering-based pruning.

Compared with Parquet column statistics, which only store per-file min/max values without any global organization, Liquid Clustering provides a higher-level layout strategy. It ensures that values of clustering keys are physically localized across files, making Parquet’s per-file statistics far more effective for data skipping and reducing the number of files scanned.

Once Trino integrates with this metadata, it will be able to perform more accurate file pruning and data skipping, enabling more flexible data organization and clustering, thereby significantly improving the performance of queries filtering on clustering keys.

ebyhr · 2025-10-23T03:03:32Z

Thanks for explaining the details. I already know about liquid clustering. I wanted to know the actual follow-up plan (especially read part) for the Delta Lake connector.

Once Trino integrates with this metadata, it will be able to perform more accurate file pruning and data skipping, enabling more flexible data organization and clustering, thereby significantly improving the performance of queries filtering on clustering keys.

The connector already uses stats from the transaction logs. Are you planning to read different metadata in the future? If so, could you elaborate on that?

chenjian2664 · 2025-10-23T14:00:28Z

@yangshangqing95 Would you mind sharing the plan for supporting the write path of liquid clustering? I suspect the read path won’t change much, since Trino already supports pruning with statistics, so reading clustering information doesn’t seem to provide additional benefit in my opinion

yangshangqing95 · 2025-10-23T20:48:24Z

Haha, please ignore my long-winded message above.
Hi @ebyhr @chenjian2664

Regarding the read performance improvements, in my actual use cases, many Delta Lake tables choose a clustering field that is a scalar nested inside a struct field, where the root field itself is a large struct. Delta Lake can record subfield statistics in its stats, but currently Trino cannot perform subfield predicate pushdown during filtering. In such scenarios, we are unable to leverage Delta Lake’s stats for effective pruning — this is a problem we’re currently facing. Once we can obtain information about the clustering column, implementing pushdown for such clustered fields shouldn’t be difficult, and I can follow up on that later.
Regarding where stats are stored, yes — according to the Delta Lake protocol, they are still recorded in the AddFile entry, and that hasn’t changed.
Regarding write-side support for Liquid Clustering, some key points are:
1. Support collecting statistics for specified columns and fields.
2. Implement the ability to compute clustering values for fields (e.g., values based on Hilbert curves).
3. Before writing files, introduce clustering operators to handle these tasks (extraction, computation, classification, sorting, etc.). If clustering is enabled, these operators should be applied. Optionally, intermediate results can be spilled to disk — fortunately, we already have this capability and can reuse it.

Coming back to this PR itself — It’s mostly about test data and test code — I hope to cover enough cases. My idea is to break down the larger system into smaller modules or features, which makes it easier to review and identify issues during testing.

Open to any discussions or suggestions.

yangshangqing95 · 2025-11-03T16:25:02Z

Hi @chenjian2664, any advice on this? Looking forward to proceeding ~

chenjian2664 · 2025-11-06T01:47:47Z

@yangshangqing95 I thought liquid clustering mainly affects the writing side, the reading part shouldn't need too many changes if we already support pruning scan files using field stats. Please help educate me: does liquid clustering write a different form of stats compared to non-clustered tables (for both primitive and struct columns)?

If we already support reading complex type stats, do we actually need the clustering info (since we don't write it)? Why would that be needed?

yangshangqing95 · 2025-11-06T20:55:15Z

@yangshangqing95 I thought liquid clustering mainly affects the writing side, the reading part shouldn't need too many changes if we already support pruning scan files using field stats. Please help educate me: does liquid clustering write a different form of stats compared to non-clustered tables (for both primitive and struct columns)?

If we already support reading complex type stats, do we actually need the clustering info (since we don't write it)? Why would that be needed?

Hi @chenjian2664 Good questions, what I mentioned earlier was too general. just share some ideas, let's make it more clear with a simple example.
There is a table with two struct columns

column	type
person	struct<age int, name string>
address	struct<city string, zip int>

I can set the cluster key for this table in Databricks by using delta.dataSkippingStatsColumns, like 'delta.dataSkippingStatsColumns'= 'address.zip, person.age'

refer to https://docs.databricks.com/aws/en/delta/data-skipping
Afterwards, we write three files.

file	person.age scope	record count
file 1	0 ~ 20	1
file 2	22 ~ 40	3
file 3	60 ~ 70	4

Then, during a query, when we use person.age as a filter condition, Databricks can leverage the clustered column statistics it maintains internally to perform pruning.
for example: filter with condition person.age < 20
Here is the query details in databricks:

Clearly, two files has been pruned and only one record was read.

But when the same query executed in Trino, things are different. Due to the subfield can not be pushed down to the tables-scan stage, the statistics of clustered fields can not be used for pruning. As a result, there would be a full table scan.

All files and all eight records were scanned.

This is also an issue I encounter in practice. I found that for the same query, Trino takes significantly longer to execute, and this performance degradation becomes more pronounced as the data volume grows. Of course, in practice, the actual table structure and data volume are much more complex than in the example above.

Now back to your question

does liquid clustering write a different form of stats compared to non-clustered tables?

No, these stats will still stored in the addfile entry, for both primitive and struct columns.

If we already support reading complex type stats, do we actually need the clustering info (since we don't write it)? Why would that be needed?

We support read complex type stats but now we're not actually using them. As we don't know if a subfield is a clustered key and don't know if the stats of the subfield has been recorded.

My thought is that we can extract the clustered key separately for filtering, elevating it to the level of a partition, then allows us to make full use of all statistical information for more thorough pruning.

And the first step is to identify which field/column is the clustered key.

ebyhr · 2025-11-06T22:27:02Z

 {
    public static final String LOCATION_PROPERTY = "location";
    public static final String PARTITIONED_BY_PROPERTY = "partitioned_by";
+    public static final String CLUSTER_BY_PROPERTY = "clustered_by";


We shouldn't expose this property until we support creating a table with this property. Otherwise, SHOW CREATE TABLE result gets non-reusable (exception should happen). The current implementation ignores the property in CREATE TABLE - this is also no go.

make sense, removed

ebyhr · 2025-11-06T22:30:08Z

+                    .collect(toImmutableList());
+        }
+        else {
+            LOG.error(String.format("Unknown clustering key: %s", clusteredKey));


Avoid using String.format in logger.

ebyhr · 2025-11-06T22:32:20Z

    public static final String VARIANT_TYPE_FEATURE_NAME = "variantType";
    public static final String VARIANT_TYPE_PREVIEW_FEATURE_NAME = "variantType-preview";
    public static final String V2_CHECKPOINT_FEATURE_NAME = "v2Checkpoint";
+    public static final String CLUSTERED_TABLES_FEATURE_NAME = "clustering";


Move this constant under CHECK_CONSTRAINTS_FEATURE_NAME for ordering by alphabetically.

ebyhr · 2025-11-06T22:33:33Z

+    public static int getTemporalTimeTravelLinearSearchMaxSize()
+    {
+        return TEMPORAL_TIME_TRAVEL_LINEAR_SEARCH_MAX_SIZE;
+    }


Make TEMPORAL_TIME_TRAVEL_LINEAR_SEARCH_MAX_SIZE public, and remove this method.

ebyhr · 2025-11-06T22:35:28Z

+
+import static org.assertj.core.api.AssertionsForClassTypes.assertThat;
+
+public class OperationTest


Please follow https://trino.io/docs/current/develop/tests.html#conventions-and-recommendations

ebyhr · 2025-11-06T22:37:44Z

+        assertThat(result.isPresent()).isTrue();
+        assertThat(ImmutableList.of()).isEqualTo(result.get());


AssertJ provides contains method for optional type.

assertThat(result).contains(ImmutableList.of());

Also, the order of parameter is wrong. assertThat() should take actual, not expected.

ebyhr · 2025-11-06T22:39:59Z

+        Optional<CommitInfoEntry> commitInfo = Optional.of(commitInfoEntry);
+        List<String> result = ClusteringMetadataUtil.extractClusteredColumns(commitInfo);
+
+        assertThat(result.isEmpty()).isTrue();


assertThat(result).isEmpty();

ebyhr · 2025-11-06T22:41:04Z

+        assertThat(2).isEqualTo(result.size());
+        assertThat(result.containsAll(ImmutableList.of("col1", "col2"))).isTrue();


Please use helper methods in AssertJ as much as possible:

assertThat(result).containsExactly("col1", "col2");

findinpath · 2025-11-07T05:27:25Z

@@ -0,0 +1 @@
+{"commitInfo":{"timestamp":1760712838006,"userId":"user1","userName":"user1","operation":"OPTIMIZE","operationParameters":{"clusterBy":"[]","zOrderBy":"[]","batchId":"0","predicate":"[]","auto":true},"notebook":{"notebookId":"xxxxxx"},"clusterId":"cluster","readVersion":9,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"5398","p25FileSize":"3647","numDeletionVectorsRemoved":"1","minFileSize":"3647","p75FileSize":"3647","p50FileSize":"3647","numAddedBytes":"3647","numAddedFiles":"1","maxFileSize":"3647"},"tags":{"delta.rowTracking.preserved":"true"},"engineInfo":"Databricks-Runtime/17.2.x-photon-scala2.13","txnId":"xxxxxx"}}


pls add README.md on how the data was created - see the other resource directories in the module as point of reference.

raunaqmorarka · 2025-11-13T05:07:21Z

column type
person struct<age int, name string>
address struct<city string, zip int>
I can set the cluster key for this table in Databricks by using delta.dataSkippingStatsColumns, like 'delta.dataSkippingStatsColumns'= 'address.zip, person.age' refer to https://docs.databricks.com/aws/en/delta/data-skipping Afterwards, we write three files.

file person.age scope record count
file 1 0 ~ 20 1
file 2 22 ~ 40 3
file 3 60 ~ 70 4
Then, during a query, when we use person.age as a filter condition, Databricks can leverage the clustered column statistics it maintains internally to perform pruning. for example: filter with condition person.age < 20 Here is the query details in databricks: Clearly, two files has been pruned and only one record was read.

But when the same query executed in Trino, things are different. Due to the subfield can not be pushed down to the tables-scan stage, the statistics of clustered fields can not be used for pruning. As a result, there would be a full table scan.
All files and all eight records were scanned.

This is also an issue I encounter in practice. I found that for the same query, Trino takes significantly longer to execute, and this performance degradation becomes more pronounced as the data volume grows. Of course, in practice, the actual table structure and data volume are much more complex than in the example above.

In that example above, does the parquet file have min/max statistics on the struct subfields ?
Even if the delta connector is not pruning splits based on delta file statistics, the parquet reader should still be able to prune parquet row groups based on statistics in parquet file footer.
There was a PR in the past to support splits pruning based on delta statistics on ROW subfields #17464. This work is independent of whether clustering exists or anything done by the writer.
fyi @krvikash

yangshangqing95 · 2025-11-13T15:16:41Z

column type
person struct<age int, name string>
address struct<city string, zip int>
I can set the cluster key for this table in Databricks by using delta.dataSkippingStatsColumns, like 'delta.dataSkippingStatsColumns'= 'address.zip, person.age' refer to https://docs.databricks.com/aws/en/delta/data-skipping Afterwards, we write three files.
file person.age scope record count
file 1 0 ~ 20 1
file 2 22 ~ 40 3
file 3 60 ~ 70 4
Then, during a query, when we use person.age as a filter condition, Databricks can leverage the clustered column statistics it maintains internally to perform pruning. for example: filter with condition person.age < 20 Here is the query details in databricks: Clearly, two files has been pruned and only one record was read.
But when the same query executed in Trino, things are different. Due to the subfield can not be pushed down to the tables-scan stage, the statistics of clustered fields can not be used for pruning. As a result, there would be a full table scan.
All files and all eight records were scanned.
This is also an issue I encounter in practice. I found that for the same query, Trino takes significantly longer to execute, and this performance degradation becomes more pronounced as the data volume grows. Of course, in practice, the actual table structure and data volume are much more complex than in the example above.

In that example above, does the parquet file have min/max statistics on the struct subfields ? Even if the delta connector is not pruning splits based on delta file statistics, the parquet reader should still be able to prune parquet row groups based on statistics in parquet file footer. There was a PR in the past to support splits pruning based on delta statistics on ROW subfields #17464. This work is independent of whether clustering exists or anything done by the writer. fyi @krvikash

Hi @raunaqmorarka I don’t think there’s any relationship between them.

Without pushing down the subfield, the Parquet reader cannot access the filter conditions for pruning.
For Delta Lake clustered key pushdown, there’s no need to read the Parquet footer for pruning, it would introduce unnecessary I/O overhead.
In my opinion, the PR Support file pruning for primitive fields predicate in ROW column #17464 you mentioned introduces a kind of “virtual” column here, which might not be a general or universal approach.

for (RowType.Field field : rowType.getFields()) {
    projectColumns(name + "#" + field.getName().orElseThrow(), field.getType(), projectedColumnNames);
}

yangshangqing95 · 2025-12-02T06:58:52Z

Hi @ebyhr @findinpath I've fixed/updated all above comments, any new thoughts on this? I’m very much looking forward to starting the next phase of work, thank you all!

github-actions · 2025-12-23T17:04:48Z

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

github-actions · 2026-01-13T17:07:24Z

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

github-actions · 2026-04-30T17:31:16Z

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

cla-bot Bot added the cla-signed label Oct 22, 2025

github-actions Bot added docs delta-lake Delta Lake connector labels Oct 22, 2025

sourcery-ai Bot reviewed Oct 22, 2025

View reviewed changes

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch 3 times, most recently from 9163402 to 660241a Compare October 22, 2025 19:22

findinpath requested a review from chenjian2664 October 24, 2025 04:26

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch from 660241a to f0d770b Compare October 31, 2025 20:42

ebyhr requested changes Nov 6, 2025

View reviewed changes

findinpath reviewed Nov 7, 2025

View reviewed changes

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch 3 times, most recently from 5a7a547 to 199ad1c Compare November 13, 2025 00:02

github-actions Bot added the stale label Dec 23, 2025

github-actions Bot closed this Jan 13, 2026

yangshangqing95 reopened this Apr 7, 2026

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch 4 times, most recently from 795de15 to f5c8230 Compare April 8, 2026 14:25

github-actions Bot removed the stale label Apr 8, 2026

yangshangqing95 requested a review from ebyhr April 9, 2026 14:33

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch from f3ae6e2 to f5c8230 Compare April 9, 2026 15:16

Support retrieving clustering information of Delta Lake tables.

d2911cf

yangshangqing95 force-pushed the retriev-deltalake-clustering-info branch from f5c8230 to d2911cf Compare April 9, 2026 15:18

github-actions Bot added the stale label Apr 30, 2026

github-actions Bot removed the stale label May 12, 2026


		import static org.assertj.core.api.AssertionsForClassTypes.assertThat;

		public class OperationTest

		assertThat(result.isPresent()).isTrue();
		assertThat(ImmutableList.of()).isEqualTo(result.get());

		assertThat(2).isEqualTo(result.size());
		assertThat(result.containsAll(ImmutableList.of("col1", "col2"))).isTrue();

		@@ -0,0 +1 @@
		{"commitInfo":{"timestamp":1760712838006,"userId":"user1","userName":"user1","operation":"OPTIMIZE","operationParameters":{"clusterBy":"[]","zOrderBy":"[]","batchId":"0","predicate":"[]","auto":true},"notebook":{"notebookId":"xxxxxx"},"clusterId":"cluster","readVersion":9,"isolationLevel":"SnapshotIsolation","isBlindAppend":false,"operationMetrics":{"numRemovedFiles":"2","numRemovedBytes":"5398","p25FileSize":"3647","numDeletionVectorsRemoved":"1","minFileSize":"3647","p75FileSize":"3647","p50FileSize":"3647","numAddedBytes":"3647","numAddedFiles":"1","maxFileSize":"3647"},"tags":{"delta.rowTracking.preserved":"true"},"engineInfo":"Databricks-Runtime/17.2.x-photon-scala2.13","txnId":"xxxxxx"}}

Conversation

yangshangqing95 commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Summary by Sourcery

Uh oh!

sourcery-ai Bot commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for retrieving clustered columns from Delta Lake table

ER diagram for Delta Lake table properties with clustering info

Class diagram for Delta Lake clustering metadata support

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ebyhr commented Oct 22, 2025

Uh oh!

yangshangqing95 commented Oct 23, 2025

Uh oh!

ebyhr commented Oct 23, 2025

Uh oh!

yangshangqing95 commented Oct 23, 2025

Uh oh!

ebyhr commented Oct 23, 2025

Uh oh!

chenjian2664 commented Oct 23, 2025

Uh oh!

yangshangqing95 commented Oct 23, 2025

Uh oh!

yangshangqing95 commented Nov 3, 2025

Uh oh!

chenjian2664 commented Nov 6, 2025

Uh oh!

yangshangqing95 commented Nov 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yangshangqing95 commented Oct 22, 2025 •

edited

Loading

sourcery-ai Bot commented Oct 22, 2025 •

edited

Loading