Read metadata and protocol information from Delta checksum files#28381
Conversation
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
I emailed my signed CLA to cla@trino.io moments ago |
08bb9c5 to
fd9787c
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
fd9787c to
de2bcd7
Compare
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
There was a problem hiding this comment.
Pull request overview
This pull request adds support for reading Delta table metadata and protocol information from checksum files (.crc files) when available, providing a significant performance optimization for tables with large v1 checkpoints. The feature is controlled by a new configuration property delta.load_metadata_from_checksum_file (defaulting to true) and corresponding session property load_metadata_from_checksum_file.
Changes:
- Added support for reading metadata and protocol information from Delta checksum files, falling back gracefully to transaction log scanning when checksum files are unavailable or incomplete
- Introduced configuration and session properties to control the new checksum file loading behavior
- Enhanced test coverage with comprehensive unit and integration tests for checksum file parsing, fallback behavior, and error handling
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| DeltaLakeConfig.java | Added new configuration property load_metadata_from_checksum_file (defaults to true) |
| DeltaLakeSessionProperties.java | Added corresponding session property for checksum metadata loading |
| DeltaLakeVersionChecksum.java | New class representing the structure of Delta checksum files with metadata and protocol entries |
| TransactionLogParser.java | Added methods getLatestCommitVersion and readVersionChecksumFile to support checksum file operations |
| DeltaLakeMetadata.java | Refactored getTableHandle to attempt loading from checksum files first before falling back to transaction log |
| DeltaLakeTableMetadataScheduler.java | Refactored isSameTransactionVersion method to accept version directly, supporting both snapshot and version checks |
| TestTransactionLogParser.java | Added comprehensive tests for checksum file reading and parsing edge cases |
| TestDeltaLakeMetadata.java | Added integration tests for checksum loading, fallback behavior, and error handling scenarios |
| TestDeltaLakeConfig.java | Updated test to validate default value of new configuration property |
| TestDeltaLakeFileOperations.java | Updated file operation tracking to account for checksum file reads |
| TestDeltaLakeBasic.java | Updated error message assertions to accommodate new error messages from checksum loading path |
| TestDeltaLakeAlluxio*.java | Updated Alluxio cache operation tests to include checksum file interactions |
| TestDeltaLakeActiveFilesCache.java | Updated to disable checksum loading for reproducing specific cache staleness issues |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Just for clarity/posterity -- I force-pushed this branch a couple times with additional changes to address test failures, to avoid trashing the commit history and since there had been no ongoing review. Now that reviewers are engaged, I'll put subsequent fixes in their own commits! |
|
Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla |
|
Based on Copilot's feedback, I went from snake_case to kebab-case for the configuration property. I have updated the PR description to reflect this change, but have not yet updated the original commit to avoid thrashing the history. The commit message must be updated prior to merge |
findinpath
left a comment
There was a problem hiding this comment.
Great observation @adam-richardson-openai
Looking forward for you to address the comments
5e4421e to
aab2cf1
Compare
|
I substantially reworked the new tests in aab2cf1. Summary:
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
| } | ||
|
|
||
| @Test | ||
| void testListFilesStartingFromHierarchicalLocationNormalization() |
There was a problem hiding this comment.
The test succeeds also without the productive code changes on EmulatedListFilesStartingFromIterator.java
I would have assumed that it was suposed to fail.
There was a problem hiding this comment.
Verified by reverting the iterator change locally: the test fails with IllegalStateException: Expected listed file to start with directory path 'dir//sub/_delta_log/': abfs://container@account.dfs.core.windows.net/dir/sub/_delta_log/00000000000000000000.json.
|
/test-with-secrets sha=a3512a0497e6f4e2745d80e36eba30a7d8ba3727 |
|
/test-with-secrets sha=a3512a0497e94828bb318e6880b1d0ab34ef5c3b |
|
The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/25065483517 |
|
/test-with-secrets sha=15a408446e015f4394da0776d8f6517c5904143f |
|
The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/25074846482 |
https://github.com/trinodb/trino/actions/runs/25074846482/job/73464777568 is this related to your changes? |
Should be unrelated, I'll trigger re-run to confirm |
|
/test-with-secrets sha=15a408446e015f4394da0776d8f6517c5904143f |
|
The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/25130600688 |
|
/test-with-secrets sha=b23837e2232f5e815bdd30b113508049b53b15af |
|
The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/25153395021 |
findinpath
left a comment
There was a problem hiding this comment.
Great effort @adam-richardson-openai & @raunaqmorarka 🎉
Thank you for providing Trino with this exciting new capability. ❤️
The iterator's strict `entryPath.startsWith(locationPath)` invariant breaks when the underlying file system canonicalizes runs of slashes: listing a directory location ending in `//` returns entries with a single slash and the check fires with `IllegalStateException`. ADLS Gen2 (hierarchical), Java NIO's `LocalFileSystem`, and `AlluxioFileSystem` canonicalize; Hadoop's `HdfsFileSystem` and `S3FileSystem` preserve `//` as a distinct path component. Try the original prefix first (preserves blob-store keys with literal `//` components), fall back to the slash-collapsed form, and compute `entryTail` from whichever matched. Surfaced by `TestDeltaLakeAdlsStorage.testQuery`.
Compliant Delta writers may emit optional checksum files alongside commits containing metadata and protocol information. Instead of loading the latest checkpoint and replaying intervening commits (which can be expensive, especially for large v1 checkpoints), Trino can read the latest commit's checksum file to obtain this information with a single listing and small JSON read. Ref. https://github.com/delta-io/delta/blob/master/PROTOCOL.md#version-checksum-file If the checksum file is missing or does not contain both metadata and protocol, we fall back to the existing Delta log scanning approach. Behavior is gated by session property load_metadata_from_checksum_file (defaulting to config delta.load_metadata_from_checksum_file, which defaults to true). Internal testing reduced analysis time for large v1-checkpoint tables from ~10s to <500ms. Within a transaction, the resolved commit version and _last_checkpoint contents are reused across loadDescriptor and getSnapshot calls so the descriptor and snapshot paths don't each re-read _last_checkpoint. Co-authored-by: Eric Hwang <eh@openai.com> Co-authored-by: Fred Liu <fredliu@openai.com>
The checksum fast path in getTableHandle bypasses the TableSnapshot cache and therefore re-parses the .crc file on every query for an unchanged table. Add a cross-query cache on TransactionLogAccess keyed by (schema.table, location, version), populated by the checksum loader, so repeated queries reuse the parsed metadata and protocol. Cache Optional<DeltaLakeTableDescriptor> so a missing or malformed checksum is remembered too; subsequent calls fall through to the transaction-log path without re-reading the .crc. The cache is bounded to 1000 entries (descriptors are small) and invalidated alongside tableSnapshots in flushCache and invalidateCache.
|
First commit LGTM |
|
This is exciting improvement! Thanks @adam-richardson-openai |
Description
Read metadata and protocol information from Delta checksum files, when configured and where available
Compliant writers of Delta tables may optionally write "checksum" files alongside each commit. These checksum files contain a variety of (optional) useful information, including the Delta table metadata and protocol information. See https://github.com/delta-io/delta/blob/488c916931ca9d210f4cadd2d5520e0274d26b04/PROTOCOL.md#version-checksum-file for the full checksum file spec
Trino needs to load the table metadata and protocol information at planning time. Today, this is done by identifying and loading the latest table checkpoint, as well as replaying all intervening commits up to the latest. This can be extremely slow and expensive, as checkpoints can be enormous and there may be many intervening commits
Instead, we can simply determine the latest commit in the table, load the corresponding checksum file (if it exists), and parse the metadata and protocol information (if available in the checksum file). This takes only a single listing operation and a single load of a small JSON file, as opposed to potentially loading many files in the Delta log based approach (some of which may be extremely large depending on the size and configuration of the table)
The listing starts from
_last_checkpoint(when present) using thelistFilesStartingFromfilesystem primitive, so the scan only covers commits since the last checkpoint rather than the entire_delta_log.If there is no checksum file for the latest eligible commit in the table, or if the checksum file doesn't capture both the metadata and the protocol information for the table, we fall back to the existing approach of scanning the Delta log. (Checksum files are considered optional under the Delta spec, as are all fields therein)
This new behavior is gated behind a session property,
load_metadata_from_checksum_file, which in turn defaults to the value of thedelta.load-metadata-from-checksum-fileconfiguration. The config value itself defaults to true, since we expect this change to be a straightforward performance optimization in the overwhelming majority of casesRepeated queries against the same table version reuse the parsed descriptor via a cross-query cache on
TransactionLogAccesskeyed by(schema.table, location, version), so the.crcparse is skipped on subsequent calls. The cache is bounded to 1000 entries since each parsed descriptor is small. This matches the hot-cache behaviour of the transaction-log path, which already benefits from theTableSnapshotcache.This optimization is particularly effective for tables using the v1 checkpoint spec, since v1 checkpoints files may be very large and heavy
We drove internal performance testing, using queries like
where
<table>is a large table using the v1 checkpoint spec. We observed that time spent in analysis fell from 10s on average to well under 500msAdditional context and related issues
Builds on #28549, which added the
listFilesStartingFromprimitive toTrinoFileSystemand is now in master. That primitive lets the checksum lookup scan only the tail of_delta_logafter the most recent checkpoint, which is what makes this optimization worthwhile on tables with large logs.The prep commit
Tolerate path normalization in EmulatedListFilesStartingFromIteratorwas added after observing the following failure when running the Databricks-credentialed ADLS test suite (TestDeltaLakeAdlsStorage.testQuery):Release notes
( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text: