Skip to content

Add support for data skipping using column stats#25601

Closed
codope wants to merge 4 commits into
trinodb:masterfrom
codope:hudi-colstats-skip
Closed

Add support for data skipping using column stats#25601
codope wants to merge 4 commits into
trinodb:masterfrom
codope:hudi-colstats-skip

Conversation

@codope
Copy link
Copy Markdown
Contributor

@codope codope commented Apr 16, 2025

Description

This PR is stacked on #25599

  • Use column stats stored in Hudi's metadata table to skip files - implemented in HudiFileSkippingManager
  • Some changes in HudiSplitSource and related classes to support above.
  • Add tests to verify both partition pruning and file skipping using stats.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hudi
* Support data skipping using statistics in Hudi metadata table.

@cla-bot cla-bot Bot added the cla-signed label Apr 16, 2025
@github-actions github-actions Bot added hudi Hudi connector hive Hive connector labels Apr 16, 2025
@codope codope requested review from ebyhr and yihua April 16, 2025 16:34
Copy link
Copy Markdown
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix CI failures.

@ksoullpwk
Copy link
Copy Markdown

I haven’t gone through the entire diff yet, but I have a question about the hudi.metadata-enabled setting. If it's enabled, and I run a query that joins two Hudi tables—one with column stats enabled during inserts and the other without—will the query fail?

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Apr 23, 2025

I haven’t gone through the entire diff yet, but I have a question about the hudi.metadata-enabled setting. If it's enabled, and I run a query that joins two Hudi tables—one with column stats enabled during inserts and the other without—will the query fail?

@ksoullpwk The query won't fail. It would still show the correct results. Just that the planner won't skip any data for the table without stats.

@codope
Copy link
Copy Markdown
Contributor Author

codope commented Apr 23, 2025

Lot of CI failures in this PR is same as that in #25599 as it is stacked on top of that. I am fixing the base PR. Once all CI checks pass there, I will rebase this PR.

@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Ask for help on #core-dev on Trino slack.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 4, 2025

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions Bot closed this Jun 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector hudi Hudi connector stale

Development

Successfully merging this pull request may close these issues.

3 participants