Fix common CI failures#29402
Conversation
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
📝 WalkthroughWalkthroughThis PR centralizes retry and transient-failure detection, tightens test synchronization and isolation, and improves cleanup in integration tests. It adds Maven Wrapper caching and verification in CI, refactors Google Sheets backoff helpers and test locking, requires QueryRunner for Cassandra visibility checks, extracts shared failure-verification helpers for Iceberg, wraps BigQuery execution with a retry policy, hardens Redshift/SQLServer retry predicates, and introduces Databricks failure classifiers with unit tests. Possibly related PRs
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.42.1)testing/trino-testing/src/main/java/io/trino/testing/AbstractTestEngineOnlyQueries.javatesting/trino-testing/src/main/java/io/trino/testing/BaseConnectorTest.javaTip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR targets common CI flakiness by tightening retry/failure classification for Databricks-related product tests, improving test isolation/serialization for resource-constrained or externally dependent tests, and stabilizing CI setup performance via improved caching.
Changes:
- Refactors Databricks transient/communication failure detection into reusable helpers and adds focused unit tests for the classification.
- Serializes/selectively isolates flaky tests (Google Sheets live API tests, memory-heavy SPI test) and improves cleanup/uniqueness in Glue metastore tests.
- Improves CI reliability/performance: caches Maven wrapper dists; adjusts Google Sheets HTTP backoff behavior; improves Cassandra visibility waiting via Trino.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| testing/trino-product-tests/src/test/java/io/trino/tests/product/utils/TestQueryExecutors.java | Adds unit coverage for Databricks transient-failure classification. |
| testing/trino-product-tests/src/test/java/io/trino/tests/product/deltalake/util/TestDeltaLakeTestUtils.java | Adds unit coverage for Databricks communication-failure classification. |
| testing/trino-product-tests/src/main/java/io/trino/tests/product/utils/QueryExecutors.java | Extracts Databricks retry predicate into a helper and expands transient conditions. |
| testing/trino-product-tests/src/main/java/io/trino/tests/product/deltalake/util/DeltaLakeTestUtils.java | Centralizes regex compilation and extracts retry predicate for Databricks communication failures. |
| plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/catalog/nessie/TestIcebergNessieCatalogConnectorSmokeTest.java | Special-cases permissible concurrent-delete failures for Nessie; adds causal-chain assertions. |
| plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorSmokeTest.java | Extracts concurrent-delete failure assertion into an overridable hook. |
| plugin/trino-google-sheets/src/test/java/io/trino/plugin/google/sheets/TestGoogleSheetsWithoutMetadataSheetId.java | Forces same-thread execution and resource locking for live API stability. |
| plugin/trino-google-sheets/src/test/java/io/trino/plugin/google/sheets/TestGoogleSheets.java | Uses new backoff/handler helpers and serializes execution with a shared lock. |
| plugin/trino-google-sheets/src/main/java/io/trino/plugin/google/sheets/SheetsClient.java | Makes backoff/handler creation per-request; broadens unsuccessful-response backoff to include 429 and 5xx. |
| plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeSharedGlueMetastoreWithTableRedirections.java | Serializes test, avoids name collisions, and ensures Glue table cleanup even on failure. |
| plugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/TestCassandraTypeMapping.java | Passes QueryRunner into Cassandra test data setup for improved visibility checks. |
| plugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/CassandraCreateAndInsertDataSetup.java | Waits for table visibility via Trino query instead of Cassandra system tables. |
| core/trino-spi/src/test/java/io/trino/spi/variant/TestMetadata.java | Marks the test as isolated due to high memory usage. |
| .github/actions/setup/action.yml | Adds caching/restoring of Maven wrapper distributions and verifies wrapper availability. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7026e10 to
1851a9b
Compare
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.java (1)
1584-1587: 💤 Low valueConsider adding a comment explaining the error message being matched.
The string literal
"Visibility check was unavailable"appears to be a specific BigQuery error message related to eventual consistency. While string matching on error messages is inherently fragile (BigQuery could change the message format), this is likely the only way to detect this transient failure type.📝 Optional: Add explanatory comment
private static boolean isTransientBigQueryFailure(Throwable throwable) { + // BigQuery returns "Visibility check was unavailable" during eventual consistency windows. + // This is a transient error that typically resolves within seconds. return nullToEmpty(throwable.getMessage()).contains("Visibility check was unavailable"); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.java` around lines 1584 - 1587, The isTransientBigQueryFailure method currently matches the literal "Visibility check was unavailable" with no context; add a concise comment above isTransientBigQueryFailure explaining that this specific BigQuery error string indicates transient eventual-consistency/visibility-check failures (why we need to treat it as transient), note that string matching is fragile but necessary here, and mention any observed BigQuery behavior or versions that produced this message so future maintainers know the rationale for using this exact literal.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.java`:
- Around line 1584-1587: The isTransientBigQueryFailure method currently matches
the literal "Visibility check was unavailable" with no context; add a concise
comment above isTransientBigQueryFailure explaining that this specific BigQuery
error string indicates transient eventual-consistency/visibility-check failures
(why we need to treat it as transient), note that string matching is fragile but
necessary here, and mention any observed BigQuery behavior or versions that
produced this message so future maintainers know the rationale for using this
exact literal.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 86cf92b1-fb48-44c4-b61b-695b74458c87
📒 Files selected for processing (12)
plugin/trino-bigquery/src/test/java/io/trino/plugin/bigquery/BaseBigQueryConnectorTest.javaplugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/CassandraCreateAndInsertDataSetup.javaplugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/TestCassandraTable.javaplugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/TestCassandraTypeMapping.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorSmokeTest.javaplugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/catalog/nessie/TestIcebergNessieCatalogConnectorSmokeTest.javaplugin/trino-redshift/src/test/java/io/trino/plugin/redshift/TestingRedshiftServer.javaplugin/trino-sqlserver/src/test/java/io/trino/plugin/sqlserver/TestingSqlServer.javatesting/trino-product-tests/src/main/java/io/trino/tests/product/deltalake/util/DeltaLakeTestUtils.javatesting/trino-product-tests/src/main/java/io/trino/tests/product/utils/QueryExecutors.javatesting/trino-product-tests/src/test/java/io/trino/tests/product/deltalake/util/TestDeltaLakeTestUtils.javatesting/trino-product-tests/src/test/java/io/trino/tests/product/utils/TestQueryExecutors.java
🚧 Files skipped from review as they are similar to previous changes (8)
- testing/trino-product-tests/src/test/java/io/trino/tests/product/deltalake/util/TestDeltaLakeTestUtils.java
- testing/trino-product-tests/src/main/java/io/trino/tests/product/utils/QueryExecutors.java
- testing/trino-product-tests/src/test/java/io/trino/tests/product/utils/TestQueryExecutors.java
- plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/catalog/nessie/TestIcebergNessieCatalogConnectorSmokeTest.java
- plugin/trino-iceberg/src/test/java/io/trino/plugin/iceberg/BaseIcebergConnectorSmokeTest.java
- plugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/TestCassandraTypeMapping.java
- plugin/trino-cassandra/src/test/java/io/trino/plugin/cassandra/CassandraCreateAndInsertDataSetup.java
- testing/trino-product-tests/src/main/java/io/trino/tests/product/deltalake/util/DeltaLakeTestUtils.java
fdbdd51 to
1a1502a
Compare
The shared ExponentialBackOff carried retry state across requests.
Tests were competing with each other while sharing API quota.
The Glue test could race inherited exact schema-listing checks and leave its extra table behind after failures.
Trino sometimes queried the table before connector metadata saw the Cassandra table created directly through CQL.
Nessie REST outages matched the same broad commit-failure prefix as the expected stale-reference conflict.
Databricks tests retried arbitrary SQLExceptions, hiding non-transient failures, while some cluster-state errors were missed.
SQL Server tests occasionally fail before startup when Testcontainers cannot pull the MCR image. Treat that startup failure as retryable.
BigQuery sometimes rejects setup statements with a transient visibility check error. Retry native setup calls so these service blips do not fail the suite.
Databricks Delta tests can fail when Glue throws an internal metastore NPE during table creation. Treat that startup-side failure as transient.
Redshift cloud tests sometimes get connection refused after the cluster is reported available. Retry that startup window like other Redshift transient failures.
Azure can reject the intentionally expired URI during creation when a one-second expiration rounds down to the service's current second.
HiveServer2 can reject Beeline connections immediately after the Hive data lake starts.
The invalid rename path left source schemas behind, polluting shared connector test databases.
The launcher mounted credentials at a path with a missing slash, so Trino could not find them during multinode-gcs startup.
Connector query tests could fail when concurrent schema changes made SHOW SCHEMAS and SHOW SCHEMAS LIKE observe different listings. The LIKE and ESCAPE behavior is engine-side, so exercise it in the engine suite instead.
The Kerberos smoke test can hide an early CTAS failure behind a cleanup-time kill_query race. Surface the async query result so CI shows whether the query failed early or completed before renewal was exercised.
HDFS placement failures currently show only the client exception, which hides why the NameNode rejected the live DataNode. Dump HDFS and container state on real test failures, and verify a startup write before running filesystem tests.
Product tests can fail during environment startup when Adoptium transiently returns 5xx while downloading the JDK tarball. Retry the download and remove partial archives so later attempts do not try to extract a corrupt file.
6d646e0 to
4a9ac4a
Compare
Release notes
(x) This is not user-visible or is docs only, and no release notes are required.